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Introduction 



“It is important to prove, 
but it is more 
important to improve 



This textbook is an introduction to Scientific Computing. We will 
illustrate several numerical methods for the computer solution of cer- 
tain classes of mathematical problems that cannot be faced by paper 
and pencil. We will show how to compute the zeros or the integrals 
of continuous functions, solve linear systems, approximate functions by 
polynomials and construct accurate approximations for the solution of 
differential equations. 

With this aim, in Chapter 1 we will illustrate the rules of the game 
that computers adopt when storing and operating with real and complex 
numbers, vectors and matrices. 

In order to make our presentation concrete and appealing we will adopt 
the programming environment MATLAB ® 1 as a faithful companion. 
We will gradually discover its principal commands, statements and con- 
structs. We will show how to execute all the algorithms that we intro- 
duce throughout the book. This will enable us to furnish an immediate 
quantitative assessment of their theoretical properties such as stability, 
accuracy and complexity. We will solve several problems that will be 
raised through exercises and examples, often stemming from specific ap- 
plications. 



1 MATLAB is a trademark of TheMathWorks Inc., 24 Prime Park Way, Natick, 
MA 01760, Tel: 001+508-647-7000, Fax: 001+508-647-7001. 




VI 



Introduction 



Several graphical devices will be adopted in order to render the read- 
ing more pleasant. We will report in the margin the MATLAB command 
along side the line where that command is being introduced for the first 

time. The symbol & will be used to indicate the presence of exercises, 

the symbol ^ to indicate the presence of a MATLAB program, while 

the symbol ^ will be used when we want to attract the attention of 
the reader on a critical or surprising behavior of an algorithm or a pro- 
cedure. The mathematical formulae of special relevance are put within a 

frame. Finally, the symbol indicates the presence of a display panel 
summarizing concepts and conclusions which have just been reported 
and drawn. 

At the end of each chapter a specific section is devoted to mentioning 
those subjects which have not been addressed and indicate the biblio- 
graphical references for a more comprehensive treatment of the material 
that we have carried out. 

Quite often we will refer to the textbook [QSSOO] where many issues 
faced in this book are treated at a deeper level, and where theoretical 
results are proven. For a more thorough description of MATLAB we refer 
to [HHOO]. All the programs introduced in this text can be downloaded 
from the web address 



mox . polimi . it/Springer. 



No special prerequisite is demanded of the reader, with the exception 
of an elementary course of Calculus. 

However, in the course of the first chapter, we recall the principal re- 
sults of Calculus and Geometry that will be used extensively throughout 
this text. The less elementary subjects, those which are not so neces- 
sary for an introductory educational path, are highlighted by the special 



symbol 



% 



We express our thanks to Thanh-Ha Le Thi from Springer- Verlag Hei- 
delberg, and to Francesca Bonadei and Marina Forlizzi from Springer- 
Italia for their friendly collaboration throughout this project. We grate- 
fully thank Prof. Eastham of Cardiff University for editing the language 
of the whole manuscript and stimulating us to clarify many points of our 
text. 
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Alfio Quarteroni, Fausto Saleri 
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1 



What can’t be ignored 



In this book we will systematically use elementary mathematical con- 
cepts which the reader should know already, yet he or she might not 
recall them immediately. 

We will therefore use this chapter to refresh them, and to introduce 
new concepts as well which pertain to the field of Numerical Analysis. We 
will begin to explore their meaning and usefulness with the help of MAT- 
LAB (MATrix LABoratory), an integrated environment for programming 
and visualization in scientific computing. In Section 1.6 we will give a 
quick introduction to MATLAB, which is sufficient for the use that we 
are going to make in this book. However, we refer the interested readers 
to the manual [HHOO] for a complete description of this language. 

In the present Chapter we have therefore condensed notions which 
are typical of courses in Calculus, Linear Algebra and Geometry, yet 
rephrasing them in a way that is suitable for use in scientific computing. 



1.1 Real numbers 



While the set R of real numbers is known to everyone, the way in which 
computers treat them is perhaps less well known. On one hand, since 
machines have limited resources, only a subset F of finite dimension of R 
can be represented. These are called floating-point numbers. On the other 
hand, as we shall see in Section 1.1.2, F is characterized by properties 
that are different from those of R . The reason is that any real number 
x is in principle truncated by the machine, giving rise to a new number 
(called the floating point number ), denoted by fl(x ), which does not 
necessarily coincide with the original number x. 
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1.1.1 How do we represent them 



To become acquainted with the differences between R and F, let us make 
a few experiments using MATLAB which illustrate the way that a com- 
puter (e.g. a PC) deals with real numbers. Whether we use MATLAB 
rather than another language is just a matter of convenience. The re- 
sults of our calculation, indeed, depend primarily on the manner in which 
the computer works, and only to a lesser degree on the programming 
language. Let us consider the rational number x = 1/7, whose decimal 
representation is 0.142857. This is an infinite representation, since the 
number of decimal digits is infinite. To get its computer representation, 
let us introduce after the prompt the ratio 1/7 and obtain 

» 1/7 
ans = 

0.1429 

which is a number with only four decimal digits, the last being different 
from the fourth digit of the original number. Should we now consider 1/3 
we would find 0.3333, so the fourth decimal digit would now be exact. 
This behavior is due to the fact that real numbers are rounded on the 
computer. This means, first of all, that only a fixed number of decimal 
digits are returned, and moreover the last decimal digit which appears is 
increased by unity whenever the first disregarded decimal digit is greater 
than or equal to 5. 

The first remark to make is that using only four decimal digits to rep- 
resent real numbers it is questionable. Indeed, the internal representation 
of the number is made with as many as 16 decimal digits, and what we 
have seen is simply one of several possible MATLAB output formats. The 
same number can take different expressions depending upon the specific 
format format declaration that is made. For instance, for the number 1/7, some 



possible output formats 


are: 




format long 


yields 


0.14285714285714, 


format short e 


?? 


1.4286e — 01, 


format long e 


55 


1.428571428571428e — 01, 


format short g 


55 


0.14286, 


format long g 


55 


0.142857142857143. 



Some of them are more coherent than others with the internal com- 
puter representation. As a matter of fact, in general a computer stores 
a real number in the following way 

x = (— l) s • ( 0 . 0^2 . . .a t ) ■ f3 e = (-l) s • m • /3 e-t , <n ^ 0, (1.1) 
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where s is either 0 or 1, (3 (a positive integer larger than or equal to 2) 
is the basis adopted by the specific computer at hand, m is an integer 
called the mantissa whose length t is the maximum number of digits a* 
(ranging between 0 and (5—1) that are stored, and e is an integral number 
called the exponent. The format long e is the one which most resembles 
this representation, and e stands for exponent; its digits, preceded by the 
sign, are reported to the right of the character e. The numbers whose 
form is given in (1.1) are called floating-point numbers, since the position 
of the decimal point is not fixed. The digits aia 2 . . . a p (with p < t) are 
often called the first p significant digits of x. 

The condition a\ ^ 0 ensures that a number cannot have multiple 
representations. For instance, without this restriction the number 1/10 
could be represented (in the decimal basis) as 0.1 • 10°, but also as 0.01 • 
10 1 , etc.. 

The set F is therefore completely characterized by the basis /3, the 
number of significant digits t and the range (L, [/) (with L < 0 and 
U > 0) of variation of the index e. Thus it is denoted as F(/3, £, L, U). 
For instance, in MATLAB we have F = F(2, 53, —1021, 1024) (indeed, 53 
significant digits in basis 2 correspond to the 15 significant digits that 
are shown from MATLAB in basis 10 with the format long). 

Fortunately, the roundoff error that is inevitably generated whenever 
a real number x ^ 0 is replaced by its representative fl(x) in F, is small, 
since 



\x- fl{x)\ ^ 1 

W “ 2 eM 



( 1 . 2 ) 



where cm = (3 l ~ t provides the distance between 1 and its closest floating- 
point number different from 1. Note that cm depends on (3 and t. For 
instance, in MATLAB cm can be obtained through the command eps, eps 
and we obtain cm = 2 -52 ~ 2.22- 10 -16 . Let us point out that in (1.2) we 
estimate the relative error on x, which is undoubtedly more meaningful 
than the absolute error \x — fl(x)\. As a matter of fact, the latter doesn’t 
account for the order of magnitude of x whereas the former does. 

Number 0 does not belong to F, as in that case we would have a\ = 0 in 
(1.1): it is therefore handled separately. Moreover, L and U being finite, 
one cannot represent numbers whose absolute value is either arbitrarily 
large or arbitrarily small. Precisely, the smallest and the largest positive 
real numbers of F are given respectively by 






L — l 



= ( 3 U ( 1 -/?-*) 
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realmin In MATLAB these values can be obtained through the commands 
realmax realmin and realmax, yielding 

x min = 2.225073858507201 • lO" 308 
Xmax = 1.7976931348623158- 10+ 308 . 



A positive number smaller than x m in produces a message of under- 
flow and is treated either as 0 or in a special way (see, e.g ., [QSS00], 
Chapter 2). A positive number greater than x max yields instead a mes- 
Inf sage of overflow and is stored in the variable Inf (which is the computer 
representation of 4-oo). 

The elements in F are more dense near x m * n , and less dense while 
approaching Xm ax . As a matter of fact, the number in F nearest to x m ax 
(to its left) and the one nearest to x m in (to its right) are, respectively 

X max = 1.7976931348623157- 10+ 308 
x+. n = 2.225073858507202- lO” 308 . 

Thus in - Xmin - 10~ 323 , while x max ~ x^ ax ~ 10 292 (!). However, 
the relative distance is small in both cases, as we can infer from (1.2). 



1.1.2 How do we operate with floating-point numbers 



Since F is a proper subset of M, elementary algebraic operations on float- 
ing point numbers do not enjoy all the properties of analogous oper- 
ations on R. Precisely, commutativity still holds for addition (that is 
fl(x 4- y) = fl(y + x )) as well as for multiplication ( fl(xy ) = fl{yx )), 
but other properties such as associativity and distributivity are violated. 
Moreover, 0 is no longer unique. Indeed, let us assign the variable a the 
value 1, and execute the following instructions: 

>> a = 1; b=l; while a+b ~= a; b=b/2; end 

The variable b is halved at every step as long as the sum of a and b 
remains different (~=) from a. Should we operate on real numbers, this 
program would never end, whereas in our case it ends after a finite 
number of steps and returns the following value for b: 1.1102e-16= 
6m/ 2. There exists therefore at least one number b different from 0 such 
that a+b=a. This is possible since F is made up of isolated numbers; when 
adding two numbers a and b with b<a and b less than 6 m, we always 
obtain that a+b is equal to a. 

Associativity is violated whenever a situation of overflow or underflow 
occurs. Take for instance a=l . 0e+308, b=l . le+308 and c=-l . 001e+308, 
and carry out the sum in two different ways. We find that 

a -j - (b -f-c) — 1.0990e 4- 308, (a 4" b) 4" c — Inf. 
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This is a particular instance of what occurs when one adds two num- 
bers with opposite sign but similar absolute value. In this case the result 
may be quite inexact and the situation is referred to as loss , or cancel- 
lation , of significant digits. For instance, let us compute ((1 + x) - l)/x 
(the obvious result being 1 for any x % 0): 

>> x = l.e - 15; ((1 + x) - l)/x 
ans = 

1.1102 




This result is rather imprecise, the relative error being larger than 11%!. 

Another case of numerical cancellation is encountered while evaluating 
the function 

f(x) =x 7 - 7x 6 + 21x 5 - 35x 4 + 35x 3 - 21x 2 + 7x - 1 (1.3) 

at 401 equispaced points with abscissa in [1 — 2 • 10 -8 , 1 + 2 • 10“ 8 ]. We 
obtain the chaotic graph reported in Figure 1.1 (the real behavior is 
that of (x — l) 7 , which is substantially constant and equal to the null 
function in such a tiny neighborhood of x = 1). In Section 1.4 we will 
see the commands that have generated the graph. 




Fig. 1.1. Oscillatory behavior of the function (1.3) caused by cancellation 
errors 

Finally, it is interesting to notice that in F there is no place for in- 
determinate forms such as 0/0 or oo/oo, whose presence produces what 
is called not a number (NaN in MATLAB) for which the normal rules of NaN 
calculus do not apply. 

Remark 1.1 Whereas it is true that roundoff errors are usually small, when 
repeated within long and complex algorithms, they may give rise to catas- 
trophic effects. Two outstanding cases concern the explosion of the Arianne 
missile on June 4, 1996, engendered by an overflow in the computer on board, 
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and the falling on an American barrack of an American Patriot missile, during 
the Gulf War in 1991, because of a roundoff error in the computation of its 
trajectory. 

An example with less catastrophic (but still troublesome) consequences is 
provided by the sequence 

*2 = 2, z n+1 = 2"- 1 /Vl - n = 2,3,... (1.4) 

which converges to 7 r when n tends to infinity. When MATLAB is used to 
compute z n , the relative error found between n and z n decreases for the first 
16 iterations , then grows because of roundoff errors (as shown in Figure 1.2). 





Fig. 1.2. Logarithm of the relative error \n — z n |/7r versus n 
See the Exercises 1. 1-1.2. 



1.2 Complex numbers 



Complex numbers, whose set is denoted by C, have the form z = x + iy 
where i = ^/— T is the imaginary unit (that is i 2 = — 1), while x = Re(z) 
and y = Im(z) are the real and imaginary part of z, respectively. They 
are generally represented on the computer as pairs of real numbers. 

Unless redefined otherwise, MATLAB variables i as well as j denote 
the imaginary unit. To introduce a complex number of real part x and 
imaginary part y, one can just write x+i*y; as an alternative, one can 
complex use the command complex (x, y) . Let us also mention the trigonometric 
(or polar) representation of a complex number z, 

z = pe l6 = p(cos# + isin0), (1.5) 

where p = y'x 2 + y 2 is the absolute value of the complex number (it can 
abs be obtained by setting abs(z)) and 6 the argument, that is the angle 
between the vector z of components (x, 2 /), and the x-axis. 9 can be found 
angle by typing angle (z). The representation (1.5) is therefore: 
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abs(z) * (cos(angle(z)) + i * sin(angle(z))). 

The polar representation of one or more complex numbers can be 
obtained through the command compass (z) where z is either a single compass 
complex number or a vector whose components are complex numbers. 

For instance, by typing 

>> z = 3-fi*3; compass(z); 

one obtains the graph reported in Figure 1.3. 

9Q 




Fig. 1.3. Output of the MATLAB command compass 

For any given complex number z, one can extract its real part with 
the command real (z) and its imaginary part with imag(z) . Finally, the real 
complex conjugate z = x — iy of 2 , can be obtained by simply writing imag 
conj(z). conj 

In MATLAB all operations are carried out by implicitly assuming that 
the operands as well as the result are complex. We may therefore find 
some apparently surprising results. For instance, if we compute the cube 
root of —5 with the MATLAB command (-5)~(l/3), instead of — 1.7099 . . . 
we obtain the complex number 0.8550 + 1.4809L (We anticipate the use 
of the symbol ~ for the power exponent.) As a matter of fact, all num- ~ 
bers of the form p e ^( 6> + 2fc7r ) ? w ith k an integer, are indistinguishable from 
z. By computing yfz we find ^e l ( 0 / 3+2fe7r / 3 ) , that is, the three distinct 
roots 

Z! = ffie idls , z 2 = ^(*/ 3 + 2 */ 3 ), 23 - ^ e ^/ 3+47r / 3 ). 

MATLAB will select the one that is encountered by spanning the complex 
plane counterclockwise beginning from the real axis. Since the polar 
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representation of z = -5 is pe ie with p = 5 and 0 = -i r, the three roots 
are (see Fig. 1.4 for their representation in the Gauss plane) 

z \ = v / 5(cos(-7t/3) + zsin(— 7t/3)) ~ 0.8550 - 1.4809z, 
z>2 = v / 5(cos(7t/3) + i sm(n/3)) ~ 0.8550 + 1.4809i, 

Z 3 = v // 5(cos(— 7 r) +isin(— 7r)) ~ —1.71. 

The second root is the one which is selected. 




Fig. 1.4. Representation in the complex plane of the three cube roots of the 
real number —5 

Finally, by (1.5) we obtain the Euler formula: 

cos(0) = \ (e ie + e ~ ie ) , sin(0) = 1 (e ie - e~ ie ) . (1.6) 



1.3 Matrices 



Let n and m be positive integers. A matrix with m rows and n columns 
is a set of m x n elements , with i = 1, . . . , m, j = 1, . . . , n, represented 
by the following table: 



an &12 
&21 cl 22 



CL\n 

CL2n 



CL ml CLm 2 



CLmn 



(1.7) 



In compact form we write A = (a^). Should the elements of A be real 
numbers, we write A £ R mxn , and A £ C mxn if they are complex. 
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Square matrices of dimension n are those with m = n. A matrix 
featuring a single column is a column vector , whereas a matrix featuring 
a single row is a row vector. 

In order to introduce a matrix in MATLAB one has to write the el- 
ements from the first to the last row, introducing the character ; to 
separate the different rows. For instance, the command 

>> A=[12 3;45 6] 

produces 

A = 

12 3 

4 5 6 

that is, a 2 x 3 matrix whose elements are indicated above. The null ma- 
trix 0 is that with null elements = 0 for i = 1 , . . . , m, j = 1 , . . . , n; it 
can be constructed by the MATLAB command zeros (m,n). The MAT- 
LAB command eye(m,n) produces a rectangular matrix whose elements 
are all 0 except those on the main diagonal which are unity. 

The main diagonal of a m x n matrix A is the diagonal made of 
elements a^, i — 1, . . . , min (m, n). 

A particular case is the command eye (n) (which is a shorthand version 
for eye (n,n) ); it produces a square matrix of dimension n which is called 
the identity matrix and is denoted by I. 

We can define the following operations: 

1. if A = (ciij) and B = (bij) are m x n matrices, the sum of A and 
B is the matrix A + B = (a^ 4- bij); 

2. the product of a matrix A by a real or complex number A is the 
matrix A A = (A a^); 

3. the product of two matrices is possible only for compatible sizes, 
precisely if A is m x p and B is p x n, for some positive integer p. 
In that case C = AB is an m x n matrix whose elements are 

v 

Cij = ~2a ik b kj , for i = 1, . . . , to, j = 1, . . . , n. 

fc= l 



Here is an example of the sum and product of two matrices. 

» A=[l 2 3; 4 5 6]; B=[7 8 9; 10 11 12]; C=[13 14; 15 16; 17 18]; 

>> A+B 

ans = 

8 10 12 



zeros 

eye 
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14 16 18 

» A*C 
ans = 

94 100 
229 244 

Note that MATLAB returns a diagnostic message when one tries to carry 
out operations on matrices with incompatible dimensions. For instance: 

>> A+C 

??? Error using ==> + 

Matrix dimensions must agree. 

» A*B 

??? Error using ==> * 

Inner matrix dimensions must agree. 



size Remark 1.2 For any given matrix A, the MATLAB command dim=size(A) 
returns the two-element variable dim=[m,n] containing the number of rows 
whos and columns of A. More in general, the command whos allows one to check the 
type of all variables in use in a work session. For instance, if we have defined 
only the matrices A, B and C as before, the command whos returns 



Name 


Size 


Bytes Class 


A 


2x3 


48 double array 


B 


2x3 


48 double array 


C 


3x2 


48 double array 



Grand total is 18 elements using 144 bytes 



If A is a square matrix of dimension n, its inverse (provided it exists) 
is a square matrix of dimension n, denoted by A -1 , which satisfies the 
matrix relation A A -1 = A -1 A = I. We can obtain A -1 through the 
inv command inv(A). The inverse of A exist iff the determinant of A, a 
number denoted by det(A), is non-zero. The latter condition is satisfied 
in turn iff the column vectors of A are linearly independent (see Section 
1.3.1). The determinant of a square matrix is defined by the following 
recursion formula (Laplace rule): 



a ii 



if n = 1, 



det(A) = < 



ijdij, for n > 1 , Vi = 1 , ... ,n, 

3 = i 



(1.8) 



where A ^ = (— l) 2+J det(A^) and A ^ is the matrix obtained by elim- 
inating the i-th row and j-th column from matrix A. In particular, if 
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A £ R lxl we set det(A) = an; if A G R 2x2 one has 



det(A) = ai\(i22 ~ &12&21; 



if A £ R 3x3 we obtain 

det(A) = ana22«33 + a 3 iai 2 a 23 4 - <221^13^32 

— ttlia23«32 — CL2iai2(l33 — a3iai3a22- 

For the matrix product we have the following property: if A = BC, 
then det(A) = det(B) • det(C). 

To invert a 2 x 2 matrix and compute its determinant we can proceed 
as follows: 

» A=[l 2; 3 4]; 

>> inv(A) 
ans = 

- 2.0000 1.0000 
1.5000 -0.5000 
>> det(A) 
ans = 

-2 

Should a matrix be singular, MATLAB returns a diagnostic message, 
followed by a matrix whose elements are all equal to Inf, as illustrated 
by the following example: 

» A=[l 2; 0 0]; 

>> inv(A) 

Warning: Matrix is singular to working precision, 
ans = 

Inf Inf 

Inf Inf 

For special classes of square matrices, the computation of inverses and 
determinants is rather simple. In particular, if A is a diagonal matrix , i.e. 
one for which only the diagonal elements a^, k = 1, . . . , n, are non-zero, 
its determinant is given by det(A) = an <122 * • * & nn . In particular, A is 
non-singular iff akk 7^ 0 for all k. In such a case the inverse of A is still 
a diagonal matrix with elements a ^ 1 . 

Let v be a vector of dimension n. The MATLAB command diag(v) diag 
produces a diagonal matrix whose elements are the components of vec- 
tor v. The more general command diag(v,m) yields a square matrix of 
dimension n+abs(m) whose ra-th upper diagonal {i.e. the diagonal made 
of elements of indices + m) has elements equal to the components 




12 



1. What can’t be ignored 



of v, while the remaining elements are null. Note that this extension is 
valid also when m is negative, in which case the only affected elements 
are those of lower diagonals. 

For instance if v = [12 3] then: 

>> A=diag(v,-1) 

A = 

0 0 0 0 

10 0 0 
0 2 0 0 

0 0 3 0 

Other special matrices are the upper triangular and lower triangular 
matrices. A square matrix of dimension n is lower (respectively, upper) 
triangular if all elements above (respectively, below) the main diagonal 
are zero. Its determinant is simply the product of the diagonal elements, 
tril Through the commands tril(A) and triu(A), one can extract from 
triu the matrix A of dimension n its lower and upper triangular part. Their 
extensions tril(A,m) or triu(A,m), with m ranging from -n and n, 
permit the extraction of the triangular part augmented by, or deprived 
of, m extradiagonals. 

For instance, given the matrix A =[3 1 2; -13 4; -2 -1 3], by the 
command Ll=tril(A) we obtain 



LI = 

3 0 0 

-13 0 

-2 -1 3 

while, by L2=tril (A, -1) , we obtain 
L2 = 

0 0 0 

-10 0 

-2 -1 0 

Finally, we recall that if A G R mxn its transpose A T G R nXm is 
the matrix obtained by interchanging rows and columns of A. When 
A = A T the matrix A is called symmetric . The MATLAB notation for 
A ’ the transpose of A is A 9 . 
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1.3.1 Vectors 



Vectors will be indicated in boldface; precisely, v will denote a column 
vector whose i-th component is denoted by Vi . When all components are 
real numbers we can write v £ R n . 

In MATLAB, vectors are regarded as particular cases of matrices. To 
introduce a column vector one has to insert between square brackets the 
values of its components separated by semi-colons, whereas for a row 
vector it suffices to write the component values separated by blanks or 
commas. For instance, through the instructions v = [1 ; 2 ; 3] and w = 
[1 2 3] we initialize the column vector v and the row vector w, both 
of dimension 3. The command zeros (n,l) (respectively, zeros (l,n)) 
produces a column (respectively, row) vector of dimension n with null 
elements, which we will denote by 0. Similarly, the command ones (n, 1) 
generates the column vector, denoted with 1, whose components are all 
equal to 1. Finally, with the command v=[ ] we initialize an empty 
vector. 

A system of vectors {yi, . . . , y m } is called linearly independent if the 
relation 



Oqyi + • • • + OtmYm = 0 



implies that all coefficients oq , . . . , ce m are null. A system B = {yi, . . . , y n } 
of n linearly independent vectors in R n (or C n ) is a basis for R n (or C n ), 
that is, any vector w in R n can be written as 

n 

w = yy^yfc, 

k= 1 

for a unique possible choice of the coefficients {u;/-}. The latter are called 
the components of w with respect to the basis B. For instance, the canon- 
ical basis of R n is the set of vectors {ei, . . . , e n }, where e* has its i-th 
component equal to 1, and all other components equal to 0. Although 
not a unique basis of R n , the latter is the one which is normally used. 
The scalar product of two vectors v, w £ R n is defined as 

n 

(v,w) = w T v = ^ ~^v k w k , 
fc = 1 

{vk} and {wk} being the components of v and w, respectively. The 
corresponding MATLAB command is w J *v, where now the apex denotes 
transposition of a vector. The length (or modulus) of a vector v is given 



zeros 

ones 
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norm 
. * 



& 



fplot 



by 



v ll = \/M = 



n 




and can be computed through the command norm(v). 

The MATLAB command x. *y or x. ~2 indicates that these operations 
should be carried out component by component. For instance if we define 
the vectors 

» v = [1; 2; 3]; w - [4; 5; 6]; 

the instruction 

>> w’*v 
ans = 

32 

provides their scalar product, while 

>> w.*v 
ans = 

4 10 18 

returns a vector whose z-th component is equal to 

Finally, we recall that a number A (real or complex) is an eigenvalue 
of the matrix A E R nxn , if 



Av = Av, 

for a suitable v G C n , v / 0, which is called an eigenvector associated 
with A. 

In general, the computation of eigenvalues is quite difficult. Exceptions 
are represented by diagonal and triangular matrices, whose eigenvalues 
are their diagonal elements. 

See the Exercises 1.3-1. 5. 



1.4 Real functions 



This chapter will deal with real functions. In particular, given a function 
/ which is defined on an interval (a, 6), we would like to compute its 
zeros, its integral and its derivative, as well as to determine its behavior. 

The command fplot (fun, lims) plots the graph of the function fun 
(which is stored as a string of characters) on the interval (lims (1) ,lims (2) ). 
For instance, to represent f(x) = 1/(1 4- x 2 ) on (—5, 5), we can write 




1.4 Real functions 



15 



>> fun = , l/(lH-x"2)’; lims=[-5,5]; fplot(funjims); 
or, more directly, 

>> fplot(’l/(l-fx /v 2)’ l [-5 5]); 

The graph is obtained by sampling the function on a set of non- 
equispaced abscissae and reproduces the true graph of / with a tolerance 
of 0.2%. To improve the accuracy we could use the command 

>> fplot(funJims,tol,n,TineSpec\Pl,P2 f ...) 

where tol indicates the desired tolerance and the parameter n(> 1) 
ensures that the function will be plotted with a minimum of n + 1 points. 
’LineSpec* is a given line specification (for instance, y — ’ for a dashed 
line, ’ : ’ for a dotted line), while the parameters P1,P2, . . . can be 
passed directly to the function fun. To use default values for tol, n or 
’LineSpec* one can pass empty matrices ([ ]). 

To evaluate a function fun at a point x we write y=eval(fun), after eval 
having initialized x. The corresponding value is stored in y. Note that x, 
and correspondingly y, can be a vector. When using this command, the 
restriction is that the argument of the function fun must be x. When 
the argument of fun has a different name (this is often the case when 
this argument is generated at the interior of a program) the command 
eval would be replaced by feval (see Remark 1.4). 

Finally, we point out that if we write grid on after the command grid 
fplot, we can obtain the background-grid as that in Figure 1.1. 

Remark 1.3 In general circumstances, it might be useful to allow any num- 
ber of arguments to a function. This is made possible by using the variable 
varargin, a cell array containing the optional arguments to the function, varargin 
varargin must be declared as the last input argument and collects all the 
inputs from that point onwards. 

For instance, the program (see Section 1.6.2) 

function L=mytril(varargin) 

L=tril(varargin{: }); 

collects all the inputs starting into the variable varargin. mytril uses the 
comma-separated list syntax varargin{ : } to pass the parameters to the func- 
tion tril. The call, 

L=mytril([l 2 3; 4 5 6; 7 8 9], -2) 

L - 

0 0 0 

0 0 0 

7 0 0 
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results in varargin being a l-by-2 cell array containing the values [12 3; 3 
4 5; 6 7 8] and -2. 



1.4.1 The zeros 



We recall that a is said to be a zero of a real function / if f(a) = 0. It 
is simple if /'(a) ^ 0, multiple otherwise. 

From the graph of a function one can infer (within a certain toler- 
ance) which are its real zeros. The direct computation of all zeros of 
a given function is not always straightforward. For functions which are 
polynomials with real coefficients of degree n, that is, of the form 

n 

p n (x) = a 0 + a\x + a 2 x 2 + . . . + a n x n = ^^akX k , a k G M, a n ^ 0, 

k = o 

we can obtain the only zero a = — ao/ai, when n = 1 (i.e. p\ represents 
a straight line), or the two zeros, a+ and a_, when n = 2 (this time p 2 
represents a parabola) 



— fli ±y/a\- 4q Q a2 



However, there are no explicit formulae for the zeros of an arbitrary 
polynomial p n when n > 5. 

Also the number of zeros of a function cannot in general be determined 
a priori . An exception is provided by polynomials, for which the number 
of zeros (real or complex) coincides with the polynomial degree. More- 
over, should a = x + iy be a zero of a polynomial, its complex conjugate 
a = x — iy is also a zero. 

To compute in MATLAB one zero of a function fun, near a given 
fzero value xO, either real or complex, the command f zero (fun, xO) can be 
used. The result is an approximate value of the desired zero, and also the 
interval in which the search was made. Alternatively, using the command 
fzero (fun, [xO xl] ) , a zero of fun is searched for in the interval whose 
extremes are x0,xl, provided / changes sign between xO and xl. 

Let us consider, for instance, the function f{pc) = x 2 — 1 + e x . Looking 
at its graph we see that there are two zeros in (—1,1). To compute them 
we need to execute the following commands: 

>> fun=’x"2 - 1 4- exp(x)’; 

>> fzero(fun,l) 

Zero found in the interval: [-0.28, 1.9051]. 
ans = 
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6.0953e-18 
>> fzero(fun.-l) 

Zero found in the interval: [-1.2263, -0.68]. 

ans = 

-0.7146 

In Chapter 2 we will introduce and investigate several methods for the 
approximate computation of the zeros of an arbitrary function. 



1.4.2 Polynomials 



Polynomials are very special functions and there is a special MATLAB 
toolbox, polyfun, for their treatment. The command polyval, is apt to polyval 
evaluate a polynomial at one or several points. Its input arguments are 
a vector p and a vector x, where the components of p are the polynomial 
coefficients stored in decreasing order, from a n down to ao, and the 
components of x are the abscissae where the polynomial needs to be 
evaluated. The result can be stored in a vector y by writing 

>> y = polyval(p.x) 

For instance, the values of p{x) = x 7 +3x 2 — 1, at the equispaced abscissae 
Xk = — 1 + k * 0.25 for k = 0, . . . , 8, can be obtained by proceeding as 
follows: 

>>p = [1 000030 -1]; x = [-1:0.25:1]; 

>> y = polyval(p,x) 

y = 

Columns 1 through 7 

1.0000 0.5540 -0.2578 -0.8126 -1.0000 -0.8124 -0.2422 

Columns 8 through 9 
0.8210 3.0000 

Alternatively, one could use the command fplot. However, in such 
case one should provide the entire analytic expression of the polynomial 
in the input string, and not simply its coefficients. 

Let us recall that if a is such that p(a) = 0, then a is called zero 
of p or, equivalently, a root of the algebraic equation p(x) = 0. The 
program roots provides an approximation of the zeros of a polynomial roots 
and requires only the input of the vector p. 

For instance, we can compute the zeros of p(x) = x 3 — 6x 2 + llx — 6 
by writing 
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>> p = [1 -6 11 -6]; format long; 

>> roots(p) 
ans = 

3.00000000000000 

2.00000000000000 

1.00000000000000 

Unfortunately, the result is not always that accurate. For instance, for 
the polynomial p(x) = (x — l) 7 , whose unique zero is a = 1, we find 
(quite surprisingly) 

>> p = [1 -7 21 -35 35 -21 7 -1]; 

>> roots(p) 
ans = 

1.0088 

1.0055 + 0.0069i 
1.0055 - 0.0069i 
0.9980 + 0.0085i 
0.9980 - 0.0085i 
0.9921 + 0.0038i 
0.9921 - 0.0038i 



This inaccuracy is due to the fact that the coefficients of p have alter- 
nating signs, yielding severe cancellation errors, 
conv We mention that using the command p=conv(pl ,p2) we can obtain 
the coefficients of the polynomial given by the product of two polyno- 
mials whose coefficients are contained in the vectors pi and p2. Simi- 
deconv larly, the command [q,r] =deconv(pl ,p2) provides the coefficients of 
the polynomials obtained on dividing pi by p2, i.e. pi = conv(p2,q) 
+ r. In other words, q and r are the quotient and the reminder of the 
division. 

Let us consider for instance the product and the ratio between the two 
polynomials pi(x) = x 4 — 1 and P 2 {x) = x 3 — 1 : 

» pi = [1 0 0 0 -1]; 

» p2 = [1 0 0 -1]; 

>> p=conv(pl,p2) 

P = 

1 0 0 -1 -1 0 0 1 
>> [q,r]=deconv(pl,p2) 

9 = 

1 0 
r = 



0 0 0 1 -1 
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We therefore find the polynomials p(x) = p\{x)p 2 {x) = x 7 — x 4 — x 3 + 1, 
q(x) = x and r(x) — x — 1 such that pi(x) = q(x)p 2 (x) + r(x). 

Finally, the commands polyint(p) and polyder(p) provides respec- polyint 
tively the coefficients of the primitive (vanishing at x = 0) and those polyder 
of the derivative of the polynomial whose coefficients are given by the 
components of the vector p. 

If x is a vector of abscissae and p (respectively, p*) is a vector con- 
taining the coefficients of a polynomial P (respectively, Pi), the previous 
commands are summarized in the following table: 



command 


yields 


y=polyval(p,x) 


y = values of P(x) 


z=roots(p) 


z = roots of P such that P(z) = 0 


p=conv(pi,p 2 ) 


p = coefficients of the polynomial P 1 P 2 


[q , r] =deconv (pi ,p 2 ) 


q = coefficients of Q, r = coefficients of R 
such that Pi = QP 2 4- R 


y=polyder (p) 


y = coefficients of P'(x) 


y=polyint (p) 


y = coefficients of f P(x) dx 



A further command, polyf it, allows the computation of the n - hi poly- 
nomial coefficients of a polynomial P of degree n once the values attained 
by P at n 4- 1 distinct nodes are available (see Section 3.1.1). 



1.4.3 Integration and differentiation 



The following two results will often be invoked throughout this book. 

1. the fundamental theorem of integration: if / is a continuous func- 
tion in [a, b ), then 

X 

F(x) = J fit) dt 

a 

is a differentiable function, called a primitive of /, which satisfies, 
Vx G [a, b), 

F'(x) = f(x); 

2. the first mean-value theorem for integrals: if / is a continuous func- 
tion in [a, b) and x\, x% G [a, 6), then G (xi,X 2 ) such that 

X 2 

/(£) = - j - [ f(t) dt - 

*^2 ^1 J 

Xl 
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Even when it does exist, a primitive might be either impossible to 
determine or difficult to compute. For instance, knowing that In \x\ is 
a primitive of 1/x is irrelevant if one doesn’t know how to compute 
efficiently the logarithms. In chapter 4 we will introduce several methods 
to compute the integral of arbitrary continuous functions with a desired 
accuracy, irrespectively of the knowledge of its primitive. 

We recall that a function / defined on an interval [a, b] is differentiable 
in a point x G (a, b) if the following limit exists and is finite 

/'(^) = lim^(/(x + /i) -/(S)). (1.9) 

If x = a this definition requires h to be positive, whereas h must be 
negative if x = b. In all cases, the value of the derivative provides the 
slope of the tangent line to the graph of / at the point x. We say that 
a function which is continuous together with its derivative at any point 
of [a, b] belongs to the space C 1 ([a, 6]). More generally, a function with 
continuous derivatives up to the order p (a positive integer) is said to 
belong to C p ([a, 6]). In particular, C°([a, 6]) denotes the space of contin- 
uous functions in [a, 6]. 

A result that will be often used is the mean value theorem , according 
to which, if / G C 1 ([a, 6]), there exists £ G (a, b) such that 

/'(£) = (f(b)-f(a))/(b-a). 

Finally, it is worth recalling that a function continuous with all its 
derivatives up to the order n + 1 in a neighborhood of xo, can be ap- 
proximated in such a neighborhood by the so-called Taylor polynomial 
of degree n at the point xq: 

T„iri = /!'/,, ) - i r - x 0 )f'(xi>) + . . . + ^{x - xt,)* 1 f in) (x a ) 

k = 0 

diff The MATLAB toolbox symbolic provides the commands diff, int 
int and taylor which allow us to obtain the analytical expression of the 
taylor derivative, the indefinite integral (i.e. a primitive) and the Taylor poly- 
nomial, respectively, of a given function. In particular, having defined in 
the string f the function on which we intend to operate, diff(f,n) 
provides its derivative of order n, int(f) its indefinite integral, and 
taylor (f ,x,n+l) the associated Taylor polynomial of degree n in a 
neighborhood of xq = 0. The variable x must be declared symbolic by 
syms using the command syms x. This will allow its algebraic manipulation 




1.5 To err is not only human 



21 



without specifying its value. 

In order to do this for the function f(x) = ( x 2 + 2x + 2)/(x 2 — 1), we 
proceed as follows: 

>> f = ’(x' v 2T2*x+2)/(x"2-1)’; 

>> syms x 
» diff(f) 

(2*x+2)/(x"2-l)-2*(x /v 2+2*x+2) / (x"2-l)"2*x 
>> int(f) 

x+5/2*log(x-l)-l/2*log(l+x) 

>> taylor(f,x,6) 

-2-2*x-3*x"2-2*x"3-3*x"4-2*x"5 



The command funtool, by the graphical interface illustrated in Fig. funtool 
1.5, allows a very easy symbolic manipulation of arbitrary functions. 




Fig. 1.5. Graphical interface of the command funtool 



See the Exercises 1.6- 1.7. 




1.5 To err is not only human 



As a matter of fact, by re-phrasing the Latin motto Errare humanum 
est , we might say that in numerical computation to err is even inevitable. 
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As we have seen, the simple fact of using a computer to represent real 
numbers introduces errors. What is therefore important is not to strive 
to eliminate errors, but rather to be able to control their effect. 

Generally speaking, we can identify several levels of errors that oc- 
cur during the approximation and resolution of a physical problem (see 
Figure 1.6). 






a 



PP 



MP 



T 




0 



e c 




e t 





x n = 




{ L 4 

k 



NP 






Fig. 1.6. Types of errors in a computational process 



At the highest level stands the error e m which occurs when forcing 
the physical reality (PP stands for physical problem and x p h denotes 
its solution) to obey some mathematical model (MP, whose solution is 
x). Such errors will limit the applicability of the mathematical model to 
certain situations and are beyond the control of Scientific Computing. 

The mathematical model (whether expressed by an integral as in the 
example of Figure 1.6, an algebraic or differential equation, a linear or 
non linear system) is generally not solvable in explicit form. Its reso- 
lution by computer algorithms will surely involve the introduction and 
propagation of roundoff errors at least. Let's call these errors e a . 

On the other hand, it is often necessary to introduce further errors 
since any procedure of the mathematical model involving an infinite 
sequence of arithmetic operations cannot be performed by the computer 
unless approximately. For instance the computation of the sum of a series 
will necessarily be accomplished in an approximate way by considering 
a suitable truncation. 
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It will therefore be necessary to introduce a numerical problem , 7VP, 
whose solution x n differs from x by an error e t which is called trunca- 
tion error. Such errors do not occur only in mathematical models that 
are already set in finite dimension (for instance, when solving a linear 
system). The sum of the errors e a and e t constitutes the computational 
error e c , the quantity we are interested in. 

The absolute computational error is the difference between x, the exact 
solution of the mathematical model, and x, the solution obtained at the 
end of the numerical process, 



while (if x ^ 0) the relative computational error is 

e r c el = \x-x\/\x\, 

where | • | denotes the modulus, or other measure of size, depending on 
the meaning of x. 

The numerical process is generally an approximation of the mathemat- 
ical model obtained as a function of a discretization parameter, which 
we will refer to as h and suppose positive. If, as h tends to 0, the nu- 
merical process returns the solution of the mathematical model, we will 
say that the numerical process is convergent. Moreover, if the (absolute 
or relative) error can be expressed as a function of h as 

e c = ChP , (1.10) 

where C is independent of h and p is a positive number, we will say that 
the method is convergent of order p. 

Example 1.1 Suppose we approximate the derivative of a function / at a 
point x with the incremental ratio that appears in (1.9). Obviously, if f is 
differentiable at x, the error committed by replacing /' by the incremental 
ratio tends to 0 as h — ► 0. However, as we will see in Section 4.1, the error can 
be considered as Ch only if / G C 2 in a neighborhood of x. 

While studying the convergence properties of a numerical procedure 
we will often deal with graphs reporting the error as a function of h 
in a logarithmic scale, which shows log(h) on the abscissae axis and 
log(e c ) on the ordinates axis. The purpose of this representation is easy 
to see: if e c = ChP then loge c = logC + plogh. p in logarithmic scale 
therefore represents the slope of the straight line loge c , so if we must 
compare two methods, the one presenting the greater slope will be the 
one with a higher order. In MATLAB it is very simple to obtain graphs in 
a logarithmic scale: one just needs to type loglog(x,y), x and y being loglog 
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the vectors containing the abscissae and the ordinates of the data to be 
represented. 

For instance, in Figure 1.7 are reported the straight lines relative to 
the behavior of the errors in two different methods. The continuous line 
represents a first-order approximation, while the hatched line represents 
a second-order one. 




Fig. 1.7. Plot in logarithmic scales 

There is an alternative to the graphical way of establishing the order 
of a method when one knows the errors e* relative to some given values 
hi of the parameter of discretization, with i = 1, . . . , N: it consists in 
supposing that e* is equal to Ch\ where C does not depend on i. One 
can then approach p with the values: 

Pi = log(ei/ei_i)/ log(/^//i,_i) i = 2, . . . , N. (1.11) 

Actually the error is not a computable quantity since it depends on 
the unknown solution. Therefore it is necessary to introduce computable 
quantities that can be used to estimate the error itself, the so called error 
estimator. We will see some examples in Sections 2.2, 2.3 and 4.3. 



1.5.1 Talking about costs 



In general a problem is solved on the computer by an algorithm, which 
is a precise directive in the form of a finite text specifying the execution 
of a finite series of elementary operations. We are interested in those 
algorithms which involve only a finite number of steps. 
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The computational cost of an algorithm is the number of floating- 
point operations (in short, ops) that are required for its execution. Of- 
ten, the speed of a computer is measured by the maximum number of 
floating-point operations which the computer can execute in one second 
(flops). In particular, the following abridged notations are commonly 
used: Mega-flops, equal to 10 6 flops , Giga-flops equal to 10 9 flops , Tera- 
flops equal to 10 12 flops. The fastest computers nowadays reach as many 
as 40 of Tera-flops. Earlier versions of MATLAB allowed the count of the 
number of floating point operations by means of the command flops. 
This is no longer possible in MATLAB 6. However, through this book 
we will sometimes make use of the command flops; in those cases it is 
understood that the version 5.3 of MATLAB is being used. 

In general, the exact knowledge of the number of operations requested 
by a given algorithm is not essential. Rather, it is useful to determine 
its order of magnitude as a function of a parameter d which is related to 
the problem dimension. We therefore say that an algorithm has constant 
complexity if it requires a number of operations independent of d, i.e. 
0(1) operations, linear complexity if it requires 0(d) operations, or, 
more generally, polynomial complexity if it requires 0(d m ) operations, 
for a positive integer m. Other algorithms may have exponential (0(c d ) 
operations) or even factorial (0(d\) operations) complexity. We recall 
that the symbol 0(d m ) means “it behaves, for large d, like a constant 
times d m ”. 

Example 1.2 (matrix- vector product) Le A be a square matrix of order 
n and let v € M n : the j — th component of the product Av is given by 

OjlVl + aj2V2 + . . . + ajnVn , 

and requires n products and n — 1 additions. One needs therefore n(2n — 1) 
ops to compute all the components. Thus this algorithm requires 0(n 2 ) ops , 
so it has a quadratic complexity with respect to the parameter n. The same 
algorithm would require Q(n 3 ) ops to compute the product of two matrices of 
order n. However, there is an algorithm, due to Strassen, which requires “only” 
(D(n log2 7 ) ops , and another, due to Winograd and Coppersmith, requiring 
0(n 2 376 ) ops. 

Example 1.3 (computation of a matrix determinant) As already men- 
tioned, the determinant of a square matrix of order n can be computed by the 
recursive formula (1.8). The corresponding algorithm has a factorial complex- 
ity with respect to n and would be usable only for matrices of small dimen- 
sion. For instance, if n — 24, a computer capable of performing as many as 
1 Peta-flops (i.e. 10 15 ops per second) would require 20 years to carry out 
this computation. One has therefore to resort to more efficient algorithms. In- 
deed, there exists an algorithm that allows the computation of determinants 
through matrix-matrix products, with henceforth a complexity of 0(n l ° S2? ) 
ops by resorting to the Strassen algorithm previously mentioned (see [BB96]). 




26 



1. What can’t be ignored 



The number of operations is not the sole parameter which matters 
in the analysis of an algorithm. Another relevant factor is represented 
by the time that is needed to access the computer memory (which de- 
pends on the way the algorithm has been coded). An indicator of the 
performance of an algorithm is therefore the CPU time (CPU stands for 
central processing unit), and can be obtained using the MATLAB com- 
cputime mand cputime. The total elapsed time between the input and output 
etime phases can be obtained by the command etime. 



Example 1.4 In order to compute the time needed for a matrix- vector mul- 
tiplication we set up the following program: 

>> A = rand(n,n); v = rand(n); T = [ ]; sizeA = [ ]; count = 1; 

>> for k = l:step:n 
AA = A(l:k,l:k); vv = v(l:k)’; 
t = cputime; b = AA*vv; tt = cputime - t; 

T = [T, tt]; sizeA = [sizeA, k]; count = count 4- 1; 
end 

a:step:b The instruction a:step:b appearing in the for cycle generates all numbers 
having the form a+step*k where k is an integer ranging from 0 to the largest 
value kmax for which a+step*kmax is not greater than b (in the case at hand, 
rand a=l, b=2000 and step=50). The command rand(n,m) defines an nxm matrix of 
random entries. Finally, T is the vector whose components contain the time of 
CPU needed to carry out every single matrix- vector product, whereas cputime 
returns the CPU time in seconds that has been used by the MATLAB process 
since MATLAB started. The time necessary to execute a single program is 
therefore the difference between the actual CPU time and the one computed 
before the execution of the current program which is stored in the variable 
t. Figure 1.8, which is obtained by the command plot (sizeA, T, , o , )) J shows 
that the CPU time grows like the square of the matrix order n. 



1.6 A few more words about MATLAB 



MATLAB is an integrated environment for scientific computing and vi- 
sualization. It is written in C language and is distributed by The Math- 
Works (see the website www.mathworks.com). 

The main program is contained in the subdirectory bin of the principal 
directory matlab. Once installed, the execution of MATLAB allows access 
» to a working environment characterized by the prompt ». For instance, 
by executing MATLAB on our personal computer it shows 
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Fig. 1.8. Matrix- vector product: the CPU time (in seconds) versus the di- 
mension n of the matrix (on a PC at 433 Mhz) 

<MATLAB> 

Copyright 1984-2001 The MathWorks, Inc. 

Version 6.1.0.450 Release 12.1 
May 18 2001 

To get started, select ’’MATLAB Help” from the Help menu. 



>> 

After pressing the key enter (or else return ), all that is written af- 
ter the prompt will be interpreted. 1 Precisely, MATLAB will first check 
whether what is written corresponds either to variables which have al- 
ready been defined or to the name of one of the programs or commands 
defined in MATLAB. Should all those checks fail, MATLAB returns an 
error warning. Otherwise, the command is executed and an output will 
possibly be displayed. In all cases, at the end the system returns the 
prompt to acknowledge that it is ready for a new command. To close a 
MATLAB session one should write the command quit (or else exit) and 
press the key enter. From now it will be understood that to execute a 
program or a command one has to press the key enter. Moreover, the 
terms program, function or command will be used in an equivalent man- 
ner. When our command coincides with one of the elementary structures 
characterizing MATLAB (e.g. a number or a string of characters that are 
put among apices) they are immediately returned in output in the default 
variable ans (abbreviation of answer). Here is an example: 



1 Thus a MATLAB program does not necessarily have to be compiled as other 
languages do, e.g. Fortran or C, although a MATLAB compiler may be invoked by 
the command mcc to allow a faster execution 



quit 

exit 



ans 
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clear 



save 

load 



help 



sin cos 
sqrt exp 

+ - * 

/ & I ~ 
> >= < 



>> ’home' 
ans = 
home 

If now we write a different string (or number), ans will assume this 
new value. 

We can turn off the automatic display of the output by writing a 
semicolon after the string. Thus if we write 'home ' ; MATLAB will simply 
return the prompt (yet assigning the value 'home' to the variable ans). 

More generally, the command = allows the assignment of a value (or 
a string of characters) to a given variable. For instance, to assign the 
string 'Welcome to NYC' to the variable a we can write 

>> a=’Welcome to NYC’; 

As we can see, there is no need to declare the type of a variable, MAT- 
LAB will do it automatically and dynamically. For instance, should we 
write a=5, the variable a will now contain a number and no longer a string 
of characters. This flexibility is not cost-free. If we set a variable named 
quit equal to the number 5 we are inhibiting the use of the MATLAB 
command quit. We should therefore try to avoid using variables having 
the name of MATLAB commands. However, by the command clear fol- 
lowed by the name of a variable (e.g. quit), it is possible to cancel this 
assignment and restore the original meaning of the command quit. 

Using the command save followed by the name fname all existing 
workspace variables are saved in the binary MATLAB file fname. mat. 
These data may be retrieved with the command load fname .mat. Omit- 
ting the filename causes save (or load) to use the default filename 
mat lab. mat. To save the variables vl , v2, ..., vn the synthax is: 
save fname vl v2 ... vn. 

By the command help one can see the whole family of commands and 
pre-defined variables, including the so-called toolboxes which are sets of 
specialized commands. Among them let us recall those which define the 
elementary functions such as sine (sin(a)), cosine (cos (a)), square root 
(sqrt (a)), exponential (exp (a)). 

There are special characters that cannot appear in the name of a 
variable or in a command, for instance the algebraic operators (+, -, 
* and /), the logical operators and (&), or (I), not (~), the relational 
operators greater than (>), greater than or equal to (>=), less than (<), 
less than or equal to (<=), equal to (==). Finally, a name can never begin 
with a digit, a bracket or with any punctuation mark. 
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1.6.1 MATLAB statements 



A special programming language, the MATLAB language, is also avail- 
able enabling the users to write new programs. Although its knowledge 
is not required for understanding how to use the several programs which 
we will introduce throughout this book, it may provide the reader with 
the capability of modifying them as well as producing new ones. 

The MATLAB language features standard statements, such as condi- 
tionals and loops. 

The if- els eif- else conditional has the following general form: 

if cond(l) 
statement(l) 
elseif cond(2) 
statement(2) 



else 

statement(n) 

end 

where cond(l) , cond(2) , ... represent MATLAB sets of instructions, with 
values 0 or 1 (false or true) and the entire construction allows the execu- 
tion of that statement corresponding to the condition taking value equal 
to 1. Should all conditions be false, the execution of statement (n) will 
take place. This means that if the value of cond(k) is zero, the control 
moves on. 

For instance, to compute the roots of a quadratic polynomial ax 2 + bx + c 
one can use the following instructions (the command disp(.) simply 
displays what is written between brackets): 



>> if a~= 0 

sq = sqrt(b *b — 4 * a * c); 
x(l) = 0.5 * (— b + sq)/a; 
x(2) = 0.5 * (— b — sq)/a; 
elseif b~= 0 

x(l) - -c/b; (1.12) 

elseif c ~= 0 

disp('lmpossible equation'); 
else 

disp('The given equation is an identity') 



end 
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Note that MATLAB does not execute the entire construction until the 
statement end is typed. 

MATLAB allows two types of loops, a for-loop (comparable to a For- 
tran do-loop or a C for-loop ) and a while-loop . A for-loop repeats the 
statements in the loop as the loop index takes on the values in a given 
row vector. For instance, to compute the first 6 terms of the Fibonacci 
sequence {/* = /*_ i + fi- 2 } with /1 = 0 and / 2 = 1, one can use the 
following instructions: 

>> f(l) = 0; f(2) = 1; 

>> for i = [3 4 5 6] 
f(i) = f(i-l) + f(i-2); 
end 

Note that a semicolon can be used to separate several MATLAB instruc- 
tions typed on the same line. Also, note that we can replace the second 
instruction by the equivalent » for i = 3:6. The while-loop repeats 
as long as the given cond is true. For instance, the following set of in- 
structions can be used as an alternative to the previous set: 

>> f(l) - 0; f(2) = 1; k = 3; 

>> while k <= 6 

f(k) = f(k-l) + f(k-2); k = k + 1; 
end 

Other statements of perhaps less frequent use exist, such as switch , case , 
otherwise. The interested reader can have access to their meaning by the 
help command. 



1.6.2 Programming in MATLAB 



Let us now explain briefly how to write MATLAB programs. As previ- 
ously noticed, new programs may be added to MATLAB. A new program 
must be put in a file with a given name with extension m, which is called 
M-file. They must be located in one of the directories in which MATLAB 
automatically searches for m-files; their list can be obtained by the corn- 
path mand path (see help path to learn how to add a directory to this list). 
The first directory scanned by MATLAB is the current working directory. 

It is important at this level to distinguish between scripts and func- 
tions. A script is simply a collection of MATLAB commands in an m-file 
and can be used interactively. For instance, the set of instructions (1.12) 
can give rise to a script (which we could name equation) by copying 
it in the file equation.m. To launch it, one can simply write after the 
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MATLAB prompt >> the instruction equation. We report two examples 
below: 

>> a = 1; b = 1; c = 1; 

>> equation 
ans = 

-0.5000 + 0.8660i -0.5000 - 0.8660i 

>> a = 0; b = 1; c — 1; 

>> equation 
ans = 

-1 

Much more flexible than scripts ar e functions. A function is, in general , 
defined in an m-file (which we will generically call name.m) beginning 
with a line of the following form: 

function [outl f ... f outn]=name(inl,...,inm) 

where out 1 , . . . , outn are the output variables and ini , , inm are the 
input variables. 

The following file, called det23 .m, defines a new function called det23 
which computes, according to the formulae given in Section 1.3, the 
determinant of a matrix whose dimension could be either 2 or 3: 

function det=det23(A) 

%DET23 computes the determinant of a square matrix 
% of dimension 2 or 3 
[n,m]=size(A); 
if n==m 
if n==2 

det - A(1,1)*A(2,2)-A(2,1)*A(1,2); 
elseif n == 3 

det - A(l,l)*det23(A[2 I 3],[2 f 3]))-A(l l 2)*det23(A([2 f 3],[l,3]))+... 

A(l,3)*det23(A[2,3],[l,2])); 

else 

disp(’ Only 2x2 or 3x3 matrices ’); 
end 
else 

disp(' Only square matrices ’); 
end 
return 

Note the use of the continuation characters . . . meaning that the in- . 
struction is continuing on the next line. The instruction A([i,j] , [k,l]) 
allows the construction of a 2 x 2 matrix whose elements are the elements 




32 



1. What can’t be ignored 



of the original matrix A lying at the intersections of the i-th and j-th 
rows with the k-th and 1-th columns. 

When a function is invoked, MATLAB creates a local workspace. The 
commands in the function cannot refer to variables from the global (in- 
teractive) workspace unless they are passed as inputs. In particular, vari- 
ables used in a function are erased when the execution terminates, unless 
they are returned as output parameters. The symbol % is used to begin 
comments. 

Usually functions terminate when the end of the function is reached, 
return however a return statement can be used to force an early return (ac- 
cording to the fulfillment of a certain condition). For instance, in order to 
approximate the golden section number a = 1.6180339887. . ., which is 
the limit for k — > oc of the quotient fk/fk- 1 , by iterating until the differ- 
ence between two consecutive ratios is less than 10~ 4 , we can construct 
the following function: 

function [golden, k]=fibonacci 

f(l) = 0; f(2) = 1; goldenold = 0; kmax = 100; tol = l.e-04; 
for k = 3:kmax 
f(k) = f(k-l) + f(k-2); 
golden = f(k)/f(k-l); 
if abs(golden - goldenold) <= tol 
return 
end 

goldenold = golden; 
end 
return 

Then, we can write 

>> [alpha, niter]=fibonacci 
alpha = 

1.61805555555556 
niter = 

14 

After 14 iterations the function has returned an approximate value which 
shares with a the first 5 significant digits. 

The number of input and output parameters of a MATLAB function 
can vary. For instance, we could modify the Fibonacci function as follows: 

function [golden, k]=fi bonacci (tol , kmax) 
if nargin == 0 

kmax = 100; tol = l.e-04; % default values 
elseif nargin == 1 
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kmax = 100; % default value only for kmax 
end 

f(l) = 0; f(2) = 1; goldenold = 0; 
for k = 3:kmax 
f(k) = f(k-l) + f(k-2); 
golden = f(k)/f(k-l); 
if abs(golden - goldenold) <= tol 
return 
end 

goldenold = golden; 
end 
return 

The function nargin counts the number of input parameters. In the nargin 
new version of the function f ibonacci we can prescribe the maximum 
number of inner iterations allowed (kmax) and a specific tolerance tol. 

When this information is missing the function must provide default val- 
ues (in our case, kmax = 100 and tol = 1 . e-04). A possible use of it is 
as follows: 

>> [alpha, niter]=fibonacci(l.e-6, 200) 
alpha — 

1.61803381340013 
niter = 

19 

Note that using a stricter tolerance we have obtained a new approximate 
value that shares with a as many as 8 significant digits. 

The function nargin can be used externally to a given function to obtain 
the number of input parameters. Here is an example: 

>> nargin(’fibonacci’) 
ans = 

2 

Remark 1.4 (inline functions) The command inline, whose most simple inline 
synthax reads g=inline(expr ,argl ,arg2, . . . ,argn) , declares a function g 
which depends on the strings argl,arg2, . . . ,argn. The string expr contains 
the expression of g. For instance, g=inline( * sin(r) ’ , ’r’) declares the func- 
tion g(r) = sin(r). The shorthand command g=inline(expr) implicitely as- 
sume that expr is a function of the default variable x. Once an inline function 
has been declared, it can be evaluated at any set of variables through the 
command feval. For instance, to evaluate g at the points z=[0 1] we can feval 
write 



>> feval(’g’,z); 
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We note that, contrarily to the case of the command eval, with feval 
the name of the variable (z) needs not coincide with the symbolic name (r) 
assigned by the inline command. 

After this quick introduction, our suggestion is to explore MATLAB 
using the command help , and get acquainted with the implementation 
of various algorithms by the programs described throughout this book. 

See the Exercises 1.8-1.11. 



1.7 What we haven't told you 



A systematic discussion on floating-point numbers can be found in [Ueb97] 
or in [QSSOO]. 

For matters concerning the issue of complexity, we refer, e.g ., to [Pan92]. 

For a more systematic introduction to MATLAB the interested reader 
can refer to the MATLAB manual [HHOO] as well as to specific books 
such as [HLR01] or [EKH02]. 



1.8 Exercises 



Exercise 1.1 How many numbers belong to the set F(2,2, — 2,2)? What is 
the value of cm for such set? 

Exercise 1.2 Show that the set F(/3, £, L, U ) contains precisely 2{(3-l)[3 t ~ 1 {U- 
L + 1) elements. 



Exercise 1.3 Write the MATLAB instructions to build an upper (respec- 
tively, lower) triangular matrix of dimension 10 having 2 on the main diagonal 
and —3 on the upper (respectively, lower) diagonal. 



Exercise 1.4 Write the MATLAB instructions which allow the interchange 
of the third and seventh row of the matrices built up in Exercise 1.3, and 
then the instructions allowing the interchange between the fourth and eighth 
column. 

Exercise 1.5 Verify whether the following vectors in R 4 are linearly indepen- 
dent: 



vi = [0 1 0 1], v 2 = [1 2 3 4], v 3 = [1 0 1 0], v 4 = [0 0 1 1]. 
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Exercise 1.6 Write the following functions and compute their first and sec- 
ond derivatives, as well as their primitives, using the symbolic toolbox of MAT- 
LAB: 



g(x) = y/x 2 + 1, f(x) = sin(x 3 ) + cosh(x). 

Exercise 1.7 For any given vector v of dimension n, using the command 
c=poly(v) one can construct the n + 1 coefficients of the polynomial p(x) = poly 
c(k)x n+1-fc which is equal to — v(k)). In exact arithmetics, 

one should find that v = roots (poly (c) ) . However, this cannot occur due to 
roundoff errors, as one can check by using the command roots (poly ( [1 :n] ) ) , 
where n ranges from 2 to 25. 

Exercise 1.8 Write a program to compute the following sequence: 

Io= -(e-1), 

e 

In+1 = 1 - (n + 1 )/n, for n = 0, 1, . . . , 

Compare the numerical result with the exact limit I n — > 0 for n — ► oo. 

Exercise 1.9 Explain the behavior of the sequence (1.4) when computed in 

MATLAB. 

Exercise 1.10 Consider the following algorithm to compute 7r. Generate n 
couples {(xk,yk)} of random numbers in the interval [0, 1], then compute the 
number m of them lying inside the first quarter of the unit circle. Obviously, 

7r turns out to be the limit of the sequence 7r n = 4 m/n. Write a MATLAB 
program to compute this sequence and check the error for increasing values of 



Exercise 1.11 Write a program for the computation of the binomial coef- 
ficient (™) = n\/(k\(n — k)\), where n and k are two natural numbers with 
k < n. 




2. Nonlinear equations 



Computing the zeros of a function / (equivalently, the roots of the equa- 
tion f(x ) = 0) is a problem that we encounter quite often in scientific 
computing. In general, this task cannot be accomplished in a finite num- 
ber of operations. For instance, we have already seen in Section 1.4.1 that 
when / is a generic polynomial of degree greater than 4, there do not 
exist closed formulae for the zeros. The situation is even more difficult 
when / is not a polynomial. 

Iterative methods are therefore adopted. Starting from one or several 
initial data, the methods build up a sequence of values x ^ that hopefully 
will converge to a zero a of the function / at hand. 

Problem 2.1 (Investment fund) At the beginning of every year a bank 
customer deposits v euros in an investment fund and withdraws, at the end of 
the n-th year, a capital of M euros. We want to compute the average yearly 
rate of interest I of this investment. Since M is related to I by the relation 

m = + i) k = [(i + iy - 1] , 

k = 1 

we deduce that I is the root of the algebraic equation: 

/(/) = 0 where /(/) = M - + I) n - 1]. 

This problem will be solved in Example 2.1. • 

Problem 2.2 (State equation of a gas) We want to determine the vol- 
ume V occupied by a gas at temperature T and pressure p. The state equation 
(i.e. the equation that relates p, V and T) is 

[p + a{N/V) 2 }{V - Nb) = kNT, (2.1) 

where a and b are two coefficients that depend on the specific gas, N is the 
number of molecules which are contained in the volume V and k is the Boltz- 
mann constant. We need therefore to solve a nonlinear equation whose root is 
V (see Exercise 2.2). • 
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Problem 2.3 (Statics) Let us consider the mechanical system represented 
by the four rigid rods a i of Figure 2.1. For any admissible value of the angle 
/. 3 , let us determine the value of the corresponding angle a between the rods 
ai and a 2 - Starting from the vector identity 

ai — a2 — a3 — a4 = 0 

and noting that the rod ai is always aligned with the x-axis, we can deduce 
the following relationship between (3 and a: 

2 2 2 2 

a l ( n\ a l / \ (o \ a l + a 2 ~ a 3 + &4 /o 

— cos(p) cos(a) — cos(p — a) = , (2.2) 

a2 <24 2CL2CL4 

where ai is the known length of the i-th rod. This is called the Freuden- 
stein equation, and we can rewrite it as follows: /(a) — 0, where /(x) — 
(ai/a 2 ) cos ((3) — (ai/a 4 ) cos(x) — cos (f3 — x) + (af + al — al + al)/(2a2a4). A 
solution in explicit form is available only for special values of (3. We would also 
like to mention that a solution does not exist for all values of (3, and may not 
even be unique. To solve the equation for any given (3 lying between 0 and 7 r 
we should invoke numerical methods (see Exercise 2.9). • 




2.1 The bisection method 



Let / be a continuous function in [a, b] which satisfies f(a)f(b ) < 0. 
Then necessarily / has at least one zero in (a, b). Assume for simplicity 
that it is unique, and let us call it a. 

(In the case of several zeros, by the help of the command fplot we 
can locate an interval which contains only one of them.) 

The strategy of the bisection method is to halve the given interval and 
select that subinterval where / features a sign change (this subinterval 
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will henceforth contain a). This procedure is iterated until the last se- 
lected interval is so small that we have reached the desired accuracy. The 
sequence {x^ } of the midpoints of these subintervals 1^ will inevitably 
tend to a since the length of the subintervals tends to zero as k tends to 
infinity. 




Fig. 2.2. A few iterations of the bisection method 

Precisely, the method is started by setting 

a = a, b = 6, 1^ = x = ( a + b^)/2. 

At the step fc > 1 we select the subinterval I ^ = (a^ k \b^) of the 
interval as follows: 

given x^ k ~^ = (a^ k ~^ + b^ k ~^)/ 2, if f(x^ k ~^) = 0 then a = x^ k ~^ 
and the method terminates; otherwise, 

if /(a^~ 1 ^)/(x^ fc_1 ^) <0 we set a ^ = a^ k ~ x \ b^ — x^ k ~ l ^\ 

if <0 we set = b^ k ~ l \ 

Then we define x ^ = ( a ^ + b ^)/ 2 and k is increased by 1. 

For instance, in the case represented in Figure 2.2, which corresponds 
to the choice f(x) = x 2 — 1, by taking = —0.25 and = 1.25, we 
would obtain 

/(o) = (-0.25,1.25), x(°)=0.5, 

/W = (0.5, 1.25), c^ 1 ) = 0.875, 

/( 2 > - (0.875, 1.25), x^ = 1.0625, 

/(3) = (0.875, 1.0625), = 0.96875. 
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Notice that every subinterval I ^ contains the zero a. Moreover, the 
sequence { x ^ } necessarily converges to a since at each step the length 
|j(fc)| _ ^(fc) _ a (k) 0 f j(k) h a i ves Since \I^\ = (l/2) fc |/(°)|, the error at 
the k — th step satisfies 



|e (fc) | = \x (k) ~ a| < \\I (k) \ = (b - a). 



In order to guarantee that \e ^ | < 5, for a given tolerance e it suffices to 
carry out A; m * n iterations, being &; m * n the least integer that satisfies the 
inequality 



kmin ^ log2 




- 1 . 



(2.3) 



Obviously, this inequality makes sense in general, and is not confined to 
the specific choice of / that we have made before. 



The bisection method is implemented in Program 1: fun is a string 
(or an inline function) that specifies the function /, a and b are the 
extremes of the search interval, tol is the tolerance e and nmax the 
maximum number of iterations allotted. When fun is an inline function 
(or a function defined in a m-file) besides the first argument relative 
to the independent function it can accept other auxiliary parameters 
needed for the definition of fun. 

Output parameters are zero, which contains the approximate value 
of a, the residual res which is the value of / in zero and niter which 
is the total number of iterations that are carried out. The command 
find find(fx==0) finds those indices of the vector fx corresponding to null 
sign components, while the command sign(fx) determines the sign of fx. 

% ZERO=BISECTION(FUN AB,TOL,NMAX) tries to find a zero ZERO of the 
% continuous function FUN in the interval [A, B] using the bisection method. 

% FUN accepts real scalar input x and returns a real scalar value. If the search 
% fails an errore message is displayed. FUN can also be an inline object. 

% ZERO=BISECTION(FUN,A,B,TOL,NMAX f Pl,P2,...) passes parameters PI, 
% P2,... to the function FUN(X,P1,P2,...). 

% [ZERO, RES, NITER]= BISECTION(FUN,...) returns the value of the 
% residual in ZERO and the iteration number at which ZERO was computed, 
x = [a, (a+b)*0.5, b]; 
fx = feval(fun,x,varargin{:}); 



Program 1 * bisection: bisection method 

function [zero,res,niter]=bisection(fun,a,b,tol,nmax,varargin) 
%BISECTION Find function zeros. 
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if fx(l)*fx(3)>0 

error(’ The sign of the function at the extrema of the interval must be different’); 
elseif fx(l) == 0 

zero = a; res = 0; niter = 0; 
return 

elseif fx(3) == 0 

zero = b; res = 0; niter = 0; 
return 
end 

niter = 0; 

I = (b - a)*0.5; 

while I >= tol & niter <= nmax 
niter = niter + 1; 
if sign(fx(l))*sign(fx(2)) < 0 

x(3) = x(2); x(2) = x(l)+(x(3)-x(l))*0.5; 

fx = feval(fun,x,varargin{:}); I = (x(3)-x(l))*0.5; 
elseif sign(fx(2))*sign(fx(3)) < 0 

x(l) - x(2); x(2) = x(l)+(x(3)-x(l))*0.5; 

fx = feval(fun,x,varargin{:}); I = (x(3)-x(l))*0.5; 
else 

x(2) = x(find(fx==0)); I = 0; 
end 
end 

if niter > nmax 

fprintf([’ bisection stopped without converging to the desired tolerance’,... 
’because the maximum number of iterations was reached\n']); 
end 

zero = x(2); x = x(2); res = feval(fun,x); 



Example 2.1 Let us apply the bisection method to solve Problem 2.1, as- 
suming that v is equal to 1000 euros and that after 5 years M is equal to 
6000 euros. The graph of the function / can be obtained by the following 
instructions 

>> f=inline(’M-v*(l+l).*((l+l). / '5 - l)./r,’l’,’M’,V); 

> > f plot (f , [0.01 ,0.3] , [] , [] , [] ,6000, 1000) 

We see that / has a unique zero in the interval (0.01,0.1), which is about 
equal to 0.06. If we execute the Program 1 with tol= 10 -12 , a= 0.01 and 
b= 0.1, after 36 iterations the method converges to the value 0.061402, in 
perfect agreement with the estimate (2.3) according to which kmin = 36. We 
thus conclude that the interest rate I is approximately equal to 6.14%. 

In spite of its simplicity, the bisection method does not guarantee a 
monotone reduction of the error, but simply that the search interval is 
halved from one iteration to the next. Consequently, if the only stopping 
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criterion that is adopted is the control of the length of I^ k \ one might 
discard approximations of a which are quite accurate. 

As a matter of fact, this method does not take into proper account 
the actual behavior of /. A striking fact is that it does not converge in 
a single iteration even if / is a linear function (unless the zero a is the 
midpoint of the initial search interval). 

See Exercises 2. 1-2.5. 



2.2 The Newton method 



A more efficient method for the computation of the zeros of a function / 
can be constructed by exploiting the differentiability of /. In that case, 

y(x) = f(x W ) + /'(x (fc) )(x - x (fc) ) 



provides the equation of the tangent to the curve (#,/(#)) at the point 

x^ k \ 

If we pretend that be such that y(x^ k+1 ^) — 0, we obtain: 









provided f'(x^) / 0 



(2.4) 



This formula allows us to compute a sequence of values x ^ starting 
from an initial guess x^°\ This method is known as Newton’s method 
and corresponds to computing the zero of / by locally replacing / by its 
tangent line (see Figure 2.3). 

As a matter of fact, by developing / in Taylor series in a neighborhood 
of a generic point x ^ we find that 

/(x (fc+1) ) = /(x (/e) ) + 6^ k) f'(x {k) ) + 0((^) 2 ), (2.5) 

where 6^ = x^ kJr1 ^ — x^ k \ Forcing f{x^ k+1 ^) to be zero and neglecting 
the term 0((S ^) 2 ), we can obtain as a function of x ^ as stated 

in (2.4). In this respect (2.4) can be regarded as an approximation of 
(2.5). 

Obviously, (2.4) converges in a single step when / is linear, that is 
f(x) = a\x + ao; in such a case 

x (i) = x (0) _ a i x(Q) + a ° = _^o . 

a\ a\ 
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Fig. 2.3. The first iterations generated by the Newton method with initial 
guess for the function f(x) = x + e x + 10/ (1 + x 2 ) — 5 

Example 2.2 Let us solve Problem 2.1 by Newton’s method, taking as initial 
data x^ — 0.3. After 6 iterations the difference between two subsequent 
iterates is less than or equal to 10 -12 . 



The Newton method in general does not converge for all possible 
choices of but only for those values of which are sufficiently 
close to a. At a first glance, this request looks meaningless: indeed, in 
order to compute a (which is unknown), one should start from a value 
sufficiently close to a\ 

In practice, a possible initial value can be obtained by resorting 
to a few iterations of the bisection method or, alternatively, through 
an investigation of the graph of /. If x^ 0>} is properly chosen and a is 
a simple zero (see Section 1.4.1) then the Newton method converges. 
Furthermore, in the special case in which / is continuously differentiable 
up to its second derivative one has the following convergence result (see 
Exercise 2.8), 



x O+i) _ a 

&(x(*0 — a) 2 = 2 f f (a) 



Consequently, Newton’s method is said to converge quadratically , or with 
order 2, since for sufficiently large values of k the error at the (k + l)-th 
step behaves like the square of the error at the £;-th step multiplied by 
a constant which is independent of k. 

In the case of zeros with multiplicity m larger than 1, the order of 
convergence of Newton’s method downgrades to 1 (see Exercise 2.15). In 
that case one could recover the order 2 by modifying the original method 
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as follows: 

x (k+i) = x (k) _ provided f'(x^) ± 0. (2.7) 

Obviously, this modified Newton method requires the a-priori knowledge 
of m. If this is not the case, one could develop an adaptive Newton 
method , still of order 2, as described in [QSS00], Section 6.6.2. 

Example 2.3 The function f(x) — (x — 1) log(x) has a single zero a = 1 of 
multiplicity m — 2. Let us compute it by both Newton’s method (2.4) and by 
its modified version (2.7). In Figure 2.4 we report the error obtained using the 
two methods versus the iteration number. Note that for the classical Newton’s 
method the convergence is only linear. 




Fig. 2.4. Error versus iteration number for the function of Example 2.3. 
Dashed line corresponds to Newton’s method (2.4), solid line to the modi- 
fied Newton’s method (2.7) (with m — 2) 

In theory, a convergent Newton’s method returns the zero a only after 
an infinite number of iterations. In practice, one requires an approxima- 
tion of a up to a prescribed tolerance e. Thus the iterations can be 
terminated at the least for which the following inequality holds: 

|g(fcmin)| = |q/ | £ 

Unfortunately, since the error is unknown, one needs to employ in its 
place a suitable error estimator , that is, a quantity that can be easily 
computed and through which we can estimate the real error. At the end 
of Section 2.3, we will see that a suitable error estimator for Newton’s 
method is provided by the difference between two successive iterates. 
This means that one terminates the iterations at the /c m * n -th step as 



soon as 



min i< £ 



( 2 . 8 ) 
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Remark 2.1 (The general case) The stopping criterion (2.8) may not be 
suitable for methods different from Newton’s. Alternatively, one could use an 
error estimator based on the residual at the step k defined as r ^ = f(x ^) 
(note that the residual is null when is a zero of the function /). 

We could stop the iteration at the first kmin for which 

|/(x (femi "))| < £. 

Using the residual as an error estimator is satisfactory only when the behavior 
of / is almost linear in a neighborhood I a of the zero a (see Figure 2.5). 
Otherwise, it will produce an over estimation of the error if \f'(x)\ 1 for 

x € la and an under estimation if \f'(x)\ 1 (see also Exercise 2.6). 




Fig. 2.5. Two situations in which the residual is a poor error estimator: 
\f'( x )\ » 1 (left), \f(x)\ < 1 (right), with x belonging to a neighborhood of 



a 



In Program 2 we implement Newton’s method (2.4). Its modified form 
can be obtained simply by replacing f with /' jm. The input parameters 
f and df are the strings which define the function / and its first deriva- 
tive, while xO is the initial guess. The method will be terminated when 
the absolute value of the difference between two subsequent iterates is 
less than the prescribed tolerance tol, or when the maximum number 
of iterations nmax has been reached. 



Program 2 - newton: Newton’s method 

function [zero, res, niter]=newton(f,df,xO, tol, nmax,varargin) 

%NEWTON Find function zeros. 

% ZERO=NEWTON(FUN,DFUN,XO, TOL, NMAX) tries to find the zero ZERO of 
% the continuous and differentiable function FUN nearest to XO using the Newton 
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% method. FUN and its derivative DFUN accept real scalar input x and returns 
% a real scalar value. If the search fails an errore message is displayed. 

% FUN and DFUN can also be inline objects. 

% ZERO=NEWTON(FUN, DFUN, XO,TOL, NMAX, PI, P2,...) passes parameters 
% P1,P2 ( ... to functions: FUN(X,P1,P2,...) and DFUN(X,P1,P2,...). 

% [ZERO, RES, NITER]= NEWTON(FUN,...) returns the value of the 
% residual in ZERO and the iteration number at which ZERO was computed. 

x = xO; 

fx = feval(f,x,varargin{:}); 
dfx = feval(df,x,varargin{:}); 
niter = 0; 
diff = tol+1; 

while diff >= tol & niter <= nmax 
niter — niter + 1; 
diff = - fx/dfx; 
x = x -F diff; 
diff = abs(diff); 
fx = feval(f,x,varargin{:}); 
dfx = feval(df,x,varargin{:}); 
end 

if niter > nmax 

fprintf([’ newton stopped without converging to the desired tolerance’,... 
’because the maximum number of iterations was reached\n']); 
end 

zero = x; res — fx; 



Let us summarize 



1. Methods for the computation of the zeros of a function / are usu- 
ally of iterative type; 

2. the bisection method computes a zero of a function / by generating 
a sequence of intervals whose length is halved at each iteration. 
This method is convergent provided that / is continuous in the 
initial interval and has opposite signs at the end-points of this 
interval; 

3. Newton’s method computes a zero a of / by taking into account 
the values of / and of its derivative. A necessary condition for con- 
vergence is that the initial datum belongs to a suitable (sufficiently 
small) neighborhood of a ; 

4. Newton’s method is quadratically convergent only when a is a 
simple zero of /, otherwise convergence is linear. 
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See Exercises 2.6-2.14. 



2.3 Fixed point iterations 



Playing with a pocket calculator, one may verify that by applying repeat- 
edly the cosine key to the real value 1, one gets the following sequence 
of real numbers: 

xW = cos(l) = 0.54030230586814, 
xW = cosOrW) = 0.85755321584639, 
z( 10 ) = cos(x( 9 )) = 0.74423735490056, 
x( 2 °) = cos(x( 19 )) = 0.73918439977149, 

which should tend to the value a = 0.73908513 — Since, by construc- 
tion, = cos(x( fc )) for k = 0,1,... (with = 1), the limit a 

satisfies the equation cos(a) = a. For this reason a is called a fixed 
point of the cosine function. We may wonder how such iterations could 
be exploited in order to compute the zeros of a given function. In the 
previous example, a is not only a fixed point for the cosine function, but 
also a zero of the function f(x) = x — cos(x), hence the method previ- 
ously proposed can be regarded as a method to compute the zeros of /. 
On the other hand, not every function has fixed points. For instance, by 
repeating the previous experiment using now the exponential function 
and still = 1 one encounters a situation of overflow after 4 steps 
only (see Figure 2.6). 




Fig. 2.6. The function (j>{x) = cosx admits one and only one fixed point (left), 
whereas the function <j>{x) — e x does not have any (right) 
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Let us clarify the above intuitive idea by considering the following 
problem. Given a function <fi : [a,b] — » R, find a E [a, b] such that 

a = (j){a ). 

If such an a exists it will be called a fixed point of </> and it could be 
computed by the following algorithm: 

(2.9) 

where x^ is an initial guess. This algorithm is called fixed point itera- 
tions and (f) is said to be the iteration function. The introductory example 
is therefore an instance of fixed point iterations with (j>(x) = cos(x). 

A geometrical interpretation of (2.9) is provided in Figure 2.7 (left). 
One can guess that if 0 is a continuous function and the limit of the 
sequence {x^} exists, then such limit is a fixed point of </>. We will 
make this result more precise in Proposition 2.1 and 2.2. 

Example 2.4 The Newton method (2.4) can be regarded as an algorithm of 
fixed point iterations whose iteration function is 

( 2 - 10 ) 

From now on this function will be denoted by 4>n (where N stands for Newton). 
This is not the case for the bisection method since the generic iterate x^ k+1 ^ 
depends not only on x ^ but also on x (yk ~ 1 \ 

As shown in Figure 2.7 (right), fixed point iterations may not converge. 
Indeed, the following result holds. 

Proposition 2.1 Assume that the iteration function in (2.9) satisfies 
the following properties: 

1. (j){x) E [a, b\ for all x E [a, b\; 

2. <j) is differentiable in [a, 6]; 

3. 3K < 1 such that |0'(x)| < K for all x E [a, b\. 

Then <f has a unique fixed point a E [a, b] and the sequence defined in 
(2.9) converges to a , whatever choice is made for the initial datum x 
in [a, b\. Moreover 



( 2 . 11 ) 
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Fig. 2.7. Representation of a few fixed point iterations for two different itera- 
tion functions. To the left, the iterations converge to the fixed point a, whereas 
the iterations on the right produce a divergence sequence 

From (2.11) one deduces that the fixed point iterations converge at least 
linearly, that is, for k sufficiently large the error at the step k + 1 be- 
haves like that at the k step multiplied by a constant ft (a) which is 
independent of k and whose absolute value is strictly less than 1. 

Example 2.5 The function <j>(x) = cos(x) satisfies all the assumptions of 
Proposition 2.1. Indeed, \4>'(a)\ = |sin(a)| ~ 0.67 < 1, and thus by continuity 
there exists a neighborhood 7 a of a such that \<f>' (x)\ < 1 for all x G I a . The 
function = x 2 — 1, although possessing two fixed points a± = (1 ± V^5)/2, 
does not satisfy the assumption for either since \<j) , (a±)\ = |1 d= y/h\ > 1. The 
corresponding fixed point iterations will not converge. 

The Newton method is not the only iterative procedure featuring 
quadratic convergence. Indeed, the following general property holds. 



Proposition 2.2 Assume that all hypotheses of Proposition 2.1 are sat- 
isfied. In addition assume that cj) is differentiable twice and that 

fi'ia) = 0, ^ 0. 



Then the fixed point iterations (2.9) converge with order 2 and 



x (fc+!) _ a 






( 2 . 12 ) 



Example 2.4 shows that the fixed point iterations (2.9) could also be 
used to compute the zeros of the function /. Clearly for any given / the 
function </> defined in (2.10) is not the only possible iteration function. 
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For instance, for the solution of the equation log(x) = 7, after setting 
f(x) = log(x) — 7, the choice (2.10) could lead to the iteration function 

<f> N (x) = x(l - log(x) + 7 ). 

Another fixed point iteration algorithm could be obtained by adding x 
to both sides of the equation f(x) = 0. The associated iteration function 
is now (j>i(x) = x + log(x) — 7. A further method could be obtained by 
choosing the iteration function 02 (x) = xlog(x)/7. Not all these methods 
are convergent. For instance, if 7 = —2, the methods corresponding to 
the iteration functions 0jv and 0 2 are both convergent, whereas the one 
with iteration function 0i is not since \(f)'i(x)\ > 1 in a neighborhood of 
the fixed point a. 



2.3.1 How to terminate fixed point iterations 



In general, fixed point iterations are terminated when the absolute value 
of the difference between two consecutive iterates is less than a prescribed 
tolerance e. 

Since a = 0(a) and = 0(x( fc )), using the mean value theorem 

(see Section 1.4.3) we find 

a - = 0(a) - 0(x (/c) ) = 0'(£ (A:) ) (a - x w ) with E I a , x (k), 

I a , x ( k ) being the interval with end-points a and x^ k \ Using the identity 
a — x ^ = (a — x( fc+1 )) + (x^ +1 ^ — x^). 



it follows that 



°-* w = 1 - 7 )) (i<wil - xl ' ,>) - (2 - i3) 

Consequently, if 0'(x) ~ 0 in a neighborhood of a, the difference between 
two consecutive iterates provides a satisfactory error estimator. This 
is the case for methods of order 2, including Newton’s method. This 
estimate becomes as more unsatisfactory as 0' approaches 1. 

Example 2.6 Let us compute with Newton’s method the zero a — 1 of the 
function f(x) — (x — l ) m_1 log(x) for m = 11 and m = 21 , whose multiplicity 
is equal to m. In this case Newton’s method converges with order 1 ; moreover, 
it is possible to prove (see Exercise 2.15) that 0jy(a) = 1 — 1/m, 0iv being the 
iteration function of the method, regarded as a fixed point iteration algorithm. 
As m increases, the accuracy of the error estimate furnished by the difference 
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between two consecutive iterates decreases. This is confirmed by the numerical 
results in Figure 2.8 where we compare the behavior of the true error with that 
of our estimator for both m — 11 and m — 21. The difference between these 
two quantities increases as m increases. 




Fig. 2.8. Absolute values of the errors (solid line) and absolute values of the 
difference between two consecutive iterates (dashed line), plotted versus the 
number of iterations for the case of Example 2.6. Graphs (1) refer to m = 11, 
graphs (2) to m = 21 



Let us summarize 



1. A number a satisfying 4>{a) = a is called a fixed point of </>. For 

its computation we can use the so-called fixed point iterations: 
x (k+l) — 0 ( x ( fc ))- 

2. fixed point iterations converge under suitable assumptions on (p and 
its first derivative. Typically, convergence is linear, however, in the 
special case when (p'(a) = 0, the fixed point iterations converge 
quadratically; 

3. fixed point iterations can be used also to compute the zeros of a 
function. 



See Exercises 2.15-2.18. 



2.4 What we haven't told you 



The most sophisticated methods for the computation of the zeros of a 
function combine different algorithms. In particular, the MATLAB func- 
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fzero tion fzero (see Section 1.4.1) adopts the so called Dekker-Brent method 
(see [QSSOO], Section 6.2.3). In its basic form fzero (fun, xO) computes 
the zero of the function fun, where fun can be either a string which is 
function of x, or the name of an inline function or the name of a m-file. 

For instance, we could solve the problem of Example 2.1 also by fzero, 
using the initial value x0=0.3 (as done by Newton’s method) through 
the following instructions: 

» f=inline(’6000 - 1000*(l+l)/l*((l+l)^5 - 1)\T); x0=0.3; 

> > [alpha , res, flag, iter] =fzero(f,xO) ; 

we obtain alpha=0 . 06140241153653 with residual res=9 . 0949e-13 in 
iter=29 iterations. When flag is negative it means that fzero cannot 
find the zero. The Newton method converges in 6 iterations to the value 
0.06140241153652 with a residual equal to 2.3646e-ll. 

In this chapter, we have frequently mentioned the problem of how to 
compute the zeros of a polynomial. The solution of this problem requires 
ad hoc algorithms which are typically based on a deflation technique, a 
procedure which eliminates automatically the zeros already computed. 
Among others let us mention the Sturm method, the Muller method, 
the Newton-Hbrner method (see [Atk89] or [QSS00]) and the Bairstow 
method ( [RR85] ) . A different approach consists of characterizing the ze- 
ros of a function as the eigenvalues of a special matrix (called the com- 
panion matrix ) and then using appropriate techniques for their compu- 
tation. This approach is adopted by the MATLAB function roots which 
has been introduced in Section 1.4.2. 

Newton’s method and fixed point iterations can be easily extended 
to compute the roots of nonlinear systems of the following form (see 
[QSSOO], Chapter 7): 

fl(x 1 ,X 2 , . . • ,x n ) = 0, 

/ 2 (xi,x 2 , . • . ,x n ) = 0, 



{ fn(x l,X 2 ,...,X n ) = 0, 

where /i, . . . , f n are nonlinear functions. Other methods exist as well, 
such as the Broyden and quasi-Newton methods, and can be regarded 
as generalizations of Newton’s method (see [DS83]). 

The MATLAB instruction 

zero = f solve( / fun / , xO) 

allows the computation of one zero of a nonlinear system defined through 
the user function fun starting from the vector xO as initial guess. The 
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function fun returns the n values /i(x), . . . , f n (x) for any value of the 
input vector x. 

For instance, let us consider the following system: 



x 2 4- y 2 = 1, 
sin(7nr/2) + y 3 = 0, 



(2.14) 



whose solutions are (0.4761, —0.8794) and (—0.4761,0.8794). 

The corresponding MATLAB user function, which we call systemnl, 
is defined as follows: 

function fx=systemnl(x) 

fx(l) = x(l) /v 2+x(2) /v 2-l; 

fx(2) = sin(pi*0.5*x(l))+x(2) /v 3; 

The MATLAB instructions to solve this system are therefore: 

» xO = [1 1]; 

>> alpha=fsolve(’systemnr,xO) 
alpha = 

0.4761 -0.8794 

Using this procedure we have found only one of the two roots. The other 
can be computed starting from the initial datum -xO. 



2.5 Exercises 



Exercise 2.1 Given the function f(x) = cosh x + cos x — 7, for 7 = 1,2, 3 find 
an interval that contains the zero of /. Then compute the zero by the bisection 
method with a tolerance of 10~ 10 . 

Exercise 2.2 For carbon dioxide (CO2) the coefficients a and b in (2.1) take 
the following values: a = OAOlPa m 6 , b = 42.7 • 10~ 6 ra 3 (Pa stands for 
Pascal). Find the volume occupied by 1000 molecules of CO2 at a temperature 
T = 300 K and a pressure p = 3.5 • 10 7 Pa by the bisection method, with a 
tolerance of 10~ 12 (the Boltzmann constant is k = 1.3806503- 10 -23 JouleR^ 1 ). 

Exercise 2.3 An object is standing on a plane whose slope varies with con- 
stant velocity u. After t seconds its position is 

s(£, uj) = — ^-r [sinh(u;t) — sin(u;t)], 

where g = 9.8 m/s 2 denotes the gravity acceleration. Assuming that this object 
has moved by 1 meter in 1 second, compute the corresponding value of u with 
a tolerance of 10~ 5 . 
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Exercise 2.4 Prove inequality (2.3). 

Exercise 2.5 Motivate why in Program 1 the instruction x( 2) = x(l) + (x(3)- 
x(l))*0.5 has been used instead of the more natural one x(2) = (x(l)+x(3))*0.5 
in order to compute the midpoint. 

Exercise 2.6 Apply Newton’s method to solve Exercise 2.1. Why is this 
method not accurate when 7 = 2? 

Exercise 2.7 Apply Newton’s method to compute the square root of a pos- 
itive number <2. Proceed in a similar manner to compute the cube root of 
a. 

Exercise 2.8 Assuming that Newton’s method converges, show that (2.6) 
is true when a is a simple root of f(x) = 0 and / is twice continuously 
differentiable in a neighborhood of a. 

Exercise 2.9 Apply Newton’s method to solve Problem 2.3 for f3 G [0, 27r/3] 
with a tolerance of 10 -5 . Assume that the lengths of rods are a\ = 10 cm, 
<22 = 13 cm, (23 = 8 cm and — 10 cm. For each value of f3 consider two 
possible initial data, x ^ = —0.1 and x ^ = 27r/3. 

Exercise 2.10 Notice that the function f(x) = e x — 2x 2 has 3 zeros, a\ < 0, 
OL 2 and 03 positive. For which value of x ^ does Newton’s method converge 
to Oi? 

Exercise 2.11 Use Newton’s method to compute the zero of f(x) = x 3 — 
3x 2 2~ x + 3x4~ x — 8~ x in [0, 1] and explain why convergence is not quadratic. 

Exercise 2.12 A projectile is ejected with velocity vo and angle a in a tunnel 
of height h and reaches its maximum range when a is such that sin (a) = 
yj2gh/vl, where g = 9.8 m/s 2 is the gravity acceleration. Compute a using 
Newton’s method, assuming that vq = 10 m/s and h — 1 m. 

Exercise 2.13 Solve Problem 2.1 by Newton’s method with a tolerance of 
10~ 12 , assuming M = 6000 euros, v = 1000 euros and n — 5. As an initial 
guess take the result obtained after 5 iterations of the bisection method applied 
on the interval (0.01,0.1). 

Exercise 2.14 A corridor has the form indicated in Figure 2.9. The maximum 
length L of a rod that can pass from one extreme to the other sliding on the 
ground is given by 

L = /2/(sin(7r — 7 — a)) + h/ sin(a), 
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where a is the solution of the nonlinear equation 



h 



cos(7 r — 7 — a) 
sin 2 (7r — 7 — a) 



cos (a) 
1 sin 2 (a) 



= 0 . 



(2.15) 



Compute a by Newton’s method when I 2 = 10, h = 8 and 7 = 37r/5. 




Fig. 2.9. The problem of a rod sliding in a corridor 



Exercise 2.15 Let 4>n be the iteration function of Newton’s method when 
regarded as a fixed point iteration. Show that 0^ (a) = 1 — 1/m where a 
is a zero of / with multiplicity m. Deduce that Newton’s method converges 
quadrat ically if a is a simple root of f(x) = 0, and linearly otherwise. 



Exercise 2.16 Deduce from the graph of f(x) = x 3 + 4x 2 — 10 that this 
function has a unique real zero a. To compute a use the following fixed point 
iterations: given x^°\ define x^ k+1 ^ such that 



(*+i) = 2(x (fc) ) 3 + 4(a; (fc) ) 2 + 10 
X 3 ( x ( fe )) 2 + 



k > 0 



and analyze its convergence to a. 



Exercise 2.17 Analyze the convergence of the fixed point iterations 



(fe+1) s< fc >[(sW) 2 + 3q] 

3( x (fc))2 _|_ a 



k> 0, 



for the computation of the square root of a positive number a. 



Exercise 2.18 Repeat the computations carried out in Exercise 2.11 using 
now the stopping criterion based on the residual (see Remark 2.1). Which is 
the more accurate result? 




3. Approximation of functions 
and data 



Approximating a function / consists of replacing it by another function 
/ of simpler form that may be used as its surrogate. This strategy is 
used frequently in numerical integration where, instead of computing 
fa f( x )dx, one carries out the exact computation of f£ f{x)dx , / being 
a function simple to integrate (e.g. a polynomial), as we will see in the 
next chapter. In other instances the function / may be available only 
partially through its values at some selected points. In these cases we 
aim at constructing a continuous function / that could represent the 
empirical law which is behind the finite set of data. We provide a couple 
of examples which illustrate this kind of approach. 



Problem 3.1 (Climatology) The air temperature near the ground depends 
on the concentration K of the carbon acid therein. In Table 3.1 (taken from 
Philosophical Magazine 41, 237 (1896)) we report for 3 different values of K 
the variation of the average temperature with respect to the actual average 
temperature (normalized at the reference value K — 1) at different latitudes 
on the Earth. In this case we can generate a function that, on the basis of the 
available data, provides an approximate value of the average temperature at 
any possible latitude and for other values of K (see Example 3.1). • 



Problem 3.2 (Finance) In Figure 3.1 we report the price of a stock at the 
Zurich stock exchange over two years. The curve was obtained by joining 
with a straight line the prices reported at every day’s closure. This simple 
representation indeed implicitly assumes that the prices change linearly in the 
course of the day (we anticipate that this approximation is called composite 
linear interpolation). We ask wheter from this graph one could predict the 
stock price for a short time interval beyond the time of the last quotation. 
We will see in Section 3.4 that this kind of prediction could be guessed by 
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Latitude 


K = 0.67 


K = 1.5 


* 

ii 

to 

o 


q 

CO 

II 


65 


-3.1 


3.52 


6.05 


9.3 


55 


-3.22 


3.62 


6.02 


9.3 


45 


-3.3 


3.65 


5.92 


9.17 


35 


-3.32 


3.52 


5.7 


8.82 


25 


-3.17 


3.47 


5.3 


8.1 


15 


-3.07 


3.25 


5.02 


7.52 


5 


-3.02 


3.15 


4.95 


7.3 


-5 


-3.02 


3.15 


4.97 


7.35 


-15 


-3.12 


3.2 


5.07 


7.62 


-25 


-3.2 


3.27 


5.35 


8.22 


-35 


-3.35 


3.52 


5.62 


8.8 


-45 


-3.37 


3.7 


5.95 


9.25 


-55 


-3.25 


3.7 


6.1 


9.5 



Tab. 3.1. Variation of the average yearly temperature on the Earth for dif- 
ferent values of the concentration K of carbon acid at different latitudes 



resorting to a special technique known as least squares approximation of data 
(see Example 3.8). • 




Fig. 3.1. Price variation of a stock over two years 

It is known that a function / can be successfully replaced in a given 
interval by its Taylor polynomial, which was introduced in Section 1.4.3. 
This technique is computationally expensive since it requires the knowl- 
edge of / and its derivatives up to the order n (the polynomial degree) at 
a given point xo . A further drawback is represented by the fact that the 
Taylor polynomial may fail to accurately represent / far enough from 
the point xo. For instance, in Figure 3.2 we compare the behavior of 
f(x) — 1/x with that of its Taylor polynomial of degree 10 built around 
the point x$ = 1. This picture also shows the graphical interface of the 
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MATLAB function taylortool which allows the computation of Taylor’s 
polynomial of arbitrary degree for any given function /. The agreement 
between the function and its Taylor polynomial is very good in a small 
neighborhood of x$ = 1 while it becomes unsatisfactory when x — xo 
gets large. Fortunately, this is not the case of other functions such as the 
exponential function which is approximated quite nicely for all x G R by 
its Taylor polynomial related to xq = 0, provided that the degree n is 
sufficiently large. 





Fig. 3.2. Comparison between the function f(x) = 1/x (solid line) and its 
Taylor polynomial of degree 10 related to the point xo = 1 (dashed line). The 
explicit form of the Taylor polynomial is also reported 



In the course of this chapter we will introduce approximation methods 
that are based on alternative approaches. 



3.1 Interpolation 



As seen in Problems 3.1 and 3.2, in several applications it may happen 
that a function is known only through its values at some given points. 
We are therefore facing a (general) case where n + 1 couples {#*, /(a^)}, 
i = 0, . . . , n, are given; the points X{ are all distinct and are called nodes. 

For instance in the case of Table 3.1, n is equal to 12, the nodes X{ are 
the values of the latitude reported in the first column, while the f(xi) are 
the corresponding values (of the temperature) in the remaining columns. 



taylortool 
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In such a situation it seems natural to require the approximate func- 
tion / to satisfy the set of relations 



f(xi) = f(xi), i = 0,1, . . . ,n 



(3.1) 



Such an / is called interpolant of / and equations (3.1) are the interpo- 
lation conditions. 

Several kinds of interpolant s could be envisaged, such as: 

- polynomial interpolant : 



f(x) = a 0 + a\x + a 2 x 2 + . . . + a n x n \ 



- trigonometric interpolant : 

f(x) = a- M e~ lMx + . . . + a 0 + . . . + a M e lMx 



where M is an integer equal to n/2 if n is even, (n — l)/2 if n is 
odd, and i is the imaginary unit; 

- rational interpolant : 



f(x) = 



&0 CtlX 

CLk + 1 + a k+2 x + ... a k+n - \x n ' 



For simplicity we only consider those interpolants which depend lin- 
early on the n- hi unknown coefficients a; . Both polynomial and trigono- 
metric interpolation fall into this category, whereas the rational inter- 
polant does not. 



3.1.1 Lagrangian polynomial interpolation 



Let us focus on the polynomial interpolation. The following result holds: 

Proposition 3.1 For any set of couples { Xi , f(xi)}, i = 0, . . . ,n, with 
distinct nodes Xi, there exists a unique polynomial of degree less than or 
equal to n, which we indicate by U n f and call interpolating polynomial 
of the values f{xf) at the nodes Xi, such that 



n nf(xi) = f(xi), i = 0,...,n 



(3.2) 
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In the case where the f(xi), i = 0, . . . ,n, represent the values attained 
by a continuous function f , U n f is called interpolating polynomial of f 
(in short , interpolant of f). 

To verify uniqueness we proceed by contradiction and suppose that 
there exist two polynomials of degree n, U n f and II f n f, both satisfying 
the nodal relation (3.2). Their difference, II n / — 11^/, would be a polyno- 
mial of degree n which vanishes at n + 1 distinct points. Owing to a well 
known theorem of Algebra, such a polynomial should vanish identically, 
and then H' n f must coincide with II n f. 

In order to obtain an expression for II n /, we start from a very special 
case where f(xi) vanishes for all i apart from i — k (for a fixed k ) for 
which f(xk) = 1. Then setting (fk(x ) = U n f(x ), we must have (see 
Figure 3.3) 



(Pk £ P?7o 



( Pk( < Xj') — Sjk — 




if j = fe, 
otherwise 



(Sjk is the Kronecker symbol). 




Fig. 3.3. The polynomial p>2 € Ph associated with a set of 5 equispaced nodes 
The functions <pk can be written as follows: 

n 

/ \ T — r % — 7 

Vk(x) = I k = 0, . . . , n 

A A Xh — Xn 

3=0 K J 

We move now to the general case where {/(£*), i = 0, . . . , n} is a set of 
arbitrary values. Using an obvious superposition principle we can obtain 



(3.3) 





polyf it 
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the following expression for U n f: 



n „f(x) = y^j(x k )ip k (x) 

k = 0 



(3-4) 



Indeed, this polynomial satisfies the interpolation conditions (3.2), since 

n n 

n nf{xi) = = J2f(x k )S lk = f(xi ), i = 0, . . . , n. 

k = 0 k = 0 

Due to their special role, the functions <pk are called Lagrange charac- 
teristic polynomials , and (3.4) is the Lagrange form of the interpolant. 

In MATLAB we can store the n+1 couples {(x*,/(xfy)} in the vectors 
x and y, and then the instruction c=polyf it (x,y ,n) will provide the 
coefficients of the interpolating polynomial. Precisely, c(l) will contain 
the coefficient of x n , c(2) that of x 11-1 , ... and c(n+l) the value of 
II n /(0). (More on this command can be found in Section 3.4.) As already 
seen in Chapter 1, we can then use the instruction p=polyval(c,z) to 
compute the value p(j) attained by the interpolating polynomial at 
the latter being a set of m arbitrary points. 

In the case when the explicit form of the function f is available, we can 
use the instruction y=eval(f) in order to obtain the vector y of values 
of f at some specific nodes (which should be stored in a vector x). 

Example 3.1 To obtain the interpolating polynomial for the data of Problem 
3.1 relating to the value K = 0.67 (first column of Table 3.1), using only the 
values of the temperature for the latitudes 65, 35, 5, -25, -55, we can use the 
following MATLAB instructions: 

>> x = [-55 -25 5 35 65]; y = [-3.25 -3.2 -3.02 -3.32 -3.1]; 

>> format short e; c=polyfit(x,y,4) 
c = 

8.2819e-08 -4.5267e-07 -3.4684e-04 3.7757e-04 -3.0132e+00 

The graph of the interpolating polynomial can be obtained as follows: 

>> z=linspace(x(l),x(end),100); 

>> p=polyval(c,z); plot(z,p,x,y,’o'); 

In order to get a smooth curve we have evaluated our polynomial at 101 
equispaced points in the interval [—55, 65] (as a matter of fact, MATLAB plots 
are always constructed on piecewise linear interpolation between neighboring 
points). Note that the instruction x(end) picks up directly the last component 
of the vector x, without specifying the length of the vector. In Figure 3.4 the 
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filled circles correspond to those values which have been used to construct the 
interpolating polynomial, whereas the empty circles correspond to values that 
have not been used. We can appreciate the qualitative agreement between the 
curve and the data distribution. 




Fig. 3.4. The interpolating polynomial of degree 4 introduced in Example 3.1 

Using the following result we can evaluate the error obtained by re- 
placing / with its interpolating polynomial II n f: 

Proposition 3.2 Let I be a bounded interval , and consider n + 1 dis- 
tinct interpolation nodes {xi,i = 0, ...,n} in I. Let f be continuously 
differentiable up to order n + 1 in I . Then \/x 6 I E / such that 



(3.5) 



Obviously, E n f(xi ) =0, i = 0, . . . , n. 

Result (3.5) can be better specified in the case of a uniform distribution 
of nodes, that is when xi = i + h for i = 1 , . . . , n, for a given h > 0 
and a given x$. As stated in Exercise 3.1, Vx £ (xo,x n ) one can verify 
that 

(3. 



and therefore 

max|/^ n+1 ^(x)| 

max\E n f(x)\ < — —— h n+l 

xei 4 [n + 1) 



TT, x X +1 

JqOr-tfi) < n! - , 

i = 0 




(3.7) 
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Unfortunately, we cannot deduce from (3.7) that the error tends to 0 
when n — ► oo, in spite of the fact that h n+1 /[A{n + 1)] tends to 0. In fact, 
as shown in Example 3.2, there exist functions / for which the limit can 
even be infinite, that is 

lim max\E n f(x)\ = oo. 

n— >oo xel 

This striking result indicates that by increasing the degree n of the 
interpolating polynomial we do not necessarily obtain a better recon- 
struction of /. For instance, should we use all data of the first column 
of Table 3.1, we would obtain the interpolating polynomial II 12 / repre- 
sented in Figure 3.5, whose behavior in the vicinity of the left-hand of 
the interval is far less satisfactory than that obtained in Figure 3.4 using 
a much smaller number of nodes. An even worse result may arise for a 
special class of functions, as we report in the next Example. 

Example 3.2 (Runge) If the function f(x) = 1/(1 + x 2 ) is interpolated 
at equispaced nodes in the interval I = (—5,5), the error max xG j \E n f(x)\ 
tends to infinity when n — >• 00 . This is due to the fact that if n — ► 00 the 
order of magnitude of max s£ / |/ (n+1) (x)| outweighs the infinitesimal order of 
fc n+1 /[4(n + l)]. This conclusion can be verified by computing the maximum of 
/ and its derivatives up to the order 21 by means of the following instructions: 

>> syms x; n=20; f=T/(l+x /v 2)’; df=diff(f,l); 

>> cdf = char(df); 

>> for i = l:n+l, df = diff(df, 1) ; cdfn = char(df); 

x = fzero(cdfn,0); M(i) = abs(eval(cdf)); cdf = cdfn; 
end 

The absolute values of the functions / (n) , n = 1 ,..., 21 , are stored in the 
vector M. Notice that the command char converts the symbolic expression df 
into a string that can be evaluated by the function fzero. In particular, the 
absolute values of for n = 3, 9, 15, 21 are: 

» M([3,9,15,21]) = 
ans = 

4.6686e+00 3.2426e+05 1.2160e+12 4.8421e+19 

n 

while the corresponding values of the maximum of JJ(x — Xi)/(n + 1)! are 

i = 0 

>> z = linspace(-5,5,10000); 

>> for n=0:20; h=10/(n-fl); x=[-5:h:5]; 

c=poly(x); r(n-hl)=max(polyval(c f z));r(n+l)=r(n-|-l)/prod([l:n+2]); end 
» r([3,9, 15,21]) 
ans = 



2.8935e+00 5.1813e-03 8.5854e-07 2.1461e-ll 
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where c=poly(x) is a vector whose elements are the coefficients of the polyno- poly 
mial whose roots are the elements of the vector x. It follows that max xe i \E n f(x) \ 
attains the following values: 

>> format short e; 

1.3509e+01 1.6801e+03 1.0442e+06 1.0399e+09 

for n = 3, 9, 15, 21, respectively. 

The lack of convergence is also indicated by the presence of severe oscilla- 
tions in the graph of the interpolating polynomial with respect to the graph 
of /, especially near the endpoints of the interval (see Figure 3.5, right). This 
behavior is known as Runge ’s phenomenon. 




Fig. 3.5. Two examples of Runge’s phenomenon: to the left, II 12 / computed 
for the data of Table 3.1, column K = 0.67; to the right, II 12 / (solid line) 
computed on 13 equispaced nodes for the function f(x) = 1/(1 -F x 2 ) (dashed 
line) 



Remark 3.1 The following inequality can also be proved: 
max| f'(x) — (n n /)'(x)| < C7i n max|/^ n+1 ^ (x)|, 

x£l x£l 

where C is a constant independent of h. Therefore, if we approximate the first 
derivative of / by the first derivative of II n /, we loose an order of convergence 
with respect to h. In MATLAB, (II n f) f can be computed using the instruction 
[d] =polyder(c) , where c is the input vector in which we store the coefficients 
of the interpolating polynomial, while d is the output vector where we store 
the coefficients of its first derivative (see Section 1.4.2). 




See the Exercises 3. 1-3.4. 
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Fig. 3.6. The left side picture shows the comparison between the function 
f(x) = 1/(1 +x 2 ) (thin solid line) and its Chebyshev interpolating polynomi- 
als of degree 8 (dashed line) and 12 (solid line). Note that the amplitude of 
spurious oscillations decreases as the degree increases. The right side picture 
shows the distribution of Chebyshev nodes in the interval [—1,1] 



3.1.2 Chebyshev interpolation 



Runge’s phenomenon can be avoided if a suitable distribution of nodes 
is used. In particular, in an arbitrary interval [a, 6], we can consider the 
so called Chebyshev nodes (see Figure 3.6, right): 

a-\-b b - a _ . _ , . , . 

Xi = — 1 — Xi, where Xi — — cos(7rz/n), i — 0, . . . , n. 

Indeed, for this special distribution of nodes it is possible to prove that, 
if / is a continuous and differentiable function in [a, 6], U n f converges 
to / as n — > oo for all x E [a, 6]. 

The Chebyshev nodes, which are the abscissas of equispaced nodes on 
the unit semi-circle, lie inside [a, b] and are clustered near the endpoints 
of this interval (see Figure 3.6). 

Another non-uniform distribution of nodes in the interval [a, 6], sharing 
the same convergence properties of Chebyshev nodes, is provided by: 



x 



i 



CL “b b 

2 



b — a 

cos 

2 



/2z + 1 tt\ 
\n + l 2J ’ 



i = 0, . . . , n. 



Example 3.3 We consider anew the function / of Runge’s example and com- 
pute its interpolating polynomial at Chebyshev nodes. The latter can be ob- 
tained through the following MATLAB instructions: 

>> xc = -cos(pi*[0:n]/n); x = (a+b)*0.5-|-(b-a)*xc*0.5; 

where n+1 is the number of nodes, while a and b are the endpoints of the 
interpolation interval (in the sequel we choose a=-5 and b=5). Then we compute 
the interpolating polynomial by the following instructions: 
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>> f= ’l./(H-x. /v 2)’; y = eval(f); c = polyfit(x,y,n); 

Now let us compute the absolute values of the differences between / and its 
Chebyshev interpolant at as many as 1001 equispaced points in the interval 
[—5, 5] and take the maximum error values: 

>> x = linspace(-5, 5,1000); p=polyval(c,x); fx = eval(f); err = max(abs(p-fx)); 
As we see in Table 3.2, the maximum of the error decreases when n increases. 



n 


5 


10 


20 


40 


E n 


0.6386 


0.1322 


0.0177 


0.0003 



Tab. 3.2. The Chebyshev interpolation error for Runge’s function 
f(x) = 1/(1 + X 2 ) 



3.1.3 Trigonometric interpolation and FFT 



We want to approximate a periodic function / : [0, 2n] —> R, i.e. 
one satisfying /( 0) = /( 27t), by a trigonometric polynomial / which 
interpolates / at the n + 1 nodes Xj = 2n j /(n + 1), j = 0, . . . , n, i.e. 

f(xj) = f(xj ), for j = 0, . . . , n. (3.8) 

The trigonometric interpolant f is obtained by a linear combination of 
sines and cosines. 

In particular, if n is even, / will have the form 

M 

f(x) = Y + ^2 cos (kx) + b k sin(fcx)] , (3.9) 

k = 1 

where M = n/2 while, if n is odd, 

M 

f(x) = + ^[afc cos(fcx) + bksin(kx )] + om + i cos ((M + l)x),(3.10) 

k= 1 

where M = (n — l)/2. We can rewrite (3.9) as 

M 

f(x) = C k e lkx , 

k=-M 



(3-11) 
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i being the imaginary unit. The coefficients c k are related to the coeffi- 
cients a k and b k as follows: 

ah = Ck + b k = i(c k -c - k ), fc = 0, . . . , M. (3.12) 

Indeed, from (1.5) it follows that e lkx — cos (kx) + isin(A:x) and 

M M 

c k e lkx — c k (cos(kx) + i sin(kx)) 

k= — M k=-M 

M 

= [ c k (cos(kx ) + i sin(kx)) + C- k (cos(kx) — i sin(for))] + cq. 

k= 1 

Therefore we derive (3.9), thanks to the relations (3.12). 

Analogously, when n is odd, (3.10) becomes 

M+l 

f( x ) = Cketkx > (3-13) 

fc=-(M+ 1) 

where the coefficients c k for k = 0, . . . , M are the same as before, while 
cm+ i = = am+i/ 2. In both cases, we could write 

/»= ]T c fc e ite , (3.14) 

fc= — (M+/x) 

with // = 0 if n is even and /i = 1 if n is odd. 

Because of its analogy with Fourier series, / is called a (real) discrete 
Fourier transform (DFT). Imposing the interpolation condition at the 
nodes Xj = jh , with h = 2n/(n + 1), we find that 

m+m 

^2 c k e lk3h = f( Xj ), j = 0, ... ,n. (3.15) 

k=-(M+n) 

For the computation of the coefficients {c k } let us multiply equations 
(3.15) by e~ imXj = e~ xrn ^ h 1 where m is an integer between 0 and n, and 
then sum with respect to j : 

n M+fj, n 

J2 J2 c k e ikjh e~ imjh = J2f(xj)e~ imjh . (3.16) 

3=0 k=-(M+ f i) j = 0 

We now require the following identity: 

n 

22e ijh(k ~ m) = (n + l)5 km , 

3=0 
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where 8km is the Kronecker symbol. This identity is obviously true if 
k = m. When / m, we have 

71 i ( £3 i(k—m)h\n -\- 1 

E e ijh(k-m) = 1 ~ \ e ) 

1 pi{k—m)h 

J=0 

The numerator on the right hand side is null, since 

1 - e *( fe -™w™+i) = 1 _ e <(fc-m) 2 W = J _ cos (( fc _ m ) 27r ) _ i sin ((fc _ m)2ir). 

Therefore, from (3.16) we get the following explicit expression for the 
coefficients of /: 

1 n 

c m = m = -(M + fx), . . . , M + fx 

n+ x=o 



The computation of all the coefficients c m can be accomplished with 
an order nlog 2 n operations by using the fast Fourier transform (FFT), 
which is implemented in the MATLAB program fft (see Example 3.4). fft 
Similar conclusions hold for the inverse transform through which we 
obtain the values {f(xj)} from the coefficients {c^}. The inverse fast 
Fourier transform is implemented in the MATLAB program ifft. ifft 

Example 3.4 Consider the function f(x) = x(x — 2n)e~ x for x G [0, 27r]. To 
use the MATLAB program fft, we evaluate / at the nodes xj — j 7r/5 for 
j = 0, . . . , 9. Then, by the following instructions (and recalling that . * is the 
component-by-component vector product): 

» x=pi/5*[0:9]; y=x.*(x-2*pi).*exp(-x); Y=fft(y); 

we compute 

>> Y = 

Columns 1 through 3 

-6.5203e+00 -4.6728e-01 + 4.2001e+00i 1.2681e+00 + 1.6211e+00i 

Columns 4 through 6 

1.0985e+00 + 6.0080e-01i 9.2585e-01 + 2.1398e-01i 8.7010e-01 - 1.3887e-16i 
Columns 7 through 9 

9.2585e-01 - 2.1398e-01i 1.0985e+00 - 6.0080e-01i 1.2681e+00 - 1.6211e+00i 
Column 10 

-4.6728e-01 - 4.2001e+00i 

The coefficients {c/e} are simply given by Y / length (Y), where the command 
length computes the dimension of a vector (10 in this example). 

Note that the program ifft achieves the maximum efficiency when n is a 
power of 2, even though it works for any value of n. 
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interpft The command interpft provides the trigonometric interpolant of a 
set of data. It requires in input an integer N and a vector of values 
which represent the values taken by a function (periodic with period p) 
at the set of points Xi = zp/M, z = 1,...,M — 1. interpft returns 
the N values of the trigonometric interpolant, obtained by the Fourier 
transform, at the nodes ti — ip/N, i = 0, . . . , iV — 1. For instance, let us 
reconsider the function of Example 3.4 in [0, 27 t] and take its values at 10 
equispaced nodes xi = Z7r/5,z = 0,...,9. The values of the trigonometric 
interpolant at, say, the 100 equispaced nodes ti — Z7r/100, z = 0, . . . , 99 
can be obtained as follows (see Figure 3.7) 

>> x=pi/5*[0:9]; y=x.*(x-2*pi).*exp(-x); z=interpft(y,100); 




Fig. 3.7. The function f(x) = x{x — 2ii)e x (dashed line) and the corre- 
sponding trigonometric interpolant (continuous line) relative to 10 equispaced 
nodes 

In some cases the accuracy of trigonometric interpolation can dramat- 
ically downgrade, as shown in the following example. 




Example 3.5 Let us approximate the function f(x) — fi(x) + f 2 (x) where 
fi(x) = sin(x) and f 2 (x) = sin(5x), using nine equispaced nodes in the interval 
[0,27 r]. The result is shown in Figure 3.8. Note that in some intervals the 
trigonometric approximant shows even a phase inversion with respect to the 
function /. 

This lack of accuracy can be explained as follows. At the nodes consid- 
ered, the function is indistinguishable from fe(x) = — sin(3x) which 
has a lower frequency (see Figure 3.9). The function that is actually ap- 
proximated is therefore F(x) = fi(x) + h(x) and not f(x) (in fact, the 
dashed line of Figure 3.8 does coincide with F). 
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Fig. 3.8. The effects of aliasing: comparison between the function 
f(x) = sin(a?) + sin(5x) (solid line) and its trigonometric interpolant (3.9) 
with M = 3 (dashed line) 



This phenomenon is known as aliasing and may occur when the func- 
tion to be approximated is the sum of several components having differ- 
ent frequencies. As soon as the number of nodes is not enough to resolve 
the highest frequencies, the latter may interfere with the low frequen- 
cies, giving rise to inaccurate interpolants. To get a better approximation 
for functions with higher frequencies, one has to increase the number of 
interpolation nodes. 

A real life example of aliasing is provided by the apparent inversion 
of the sense of rotation of spoked wheels. Once a certain critical velocity 
is reached the human brain is no longer able to accurately sample the 
moving image and, consequently, produces distorted images. 



o. 



-o. 



Fig. 3.9. The phenomenon of aliasing: the functions sin(5x) (dashed line) and 
— sin(3x) (dotted line) take the same values at the interpolation nodes. This 
circumstance explains the severe loss of accuracy shown in Figure 3.8 
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Let us summarize 



1. Approximating a set of data or a function / in [a, b] consists of 
finding a suitable function / that represents them with enough 
accuracy; 

2. the interpolation process consists of determining a function / such 
that f(xi) = yi , where the {xi} are given nodes and {yi} are either 
the values {f(xi)} or a set of prescribed values; 

3. if the n - fl nodes {xi} are distinct, there exists a unique polynomial 
of degree less than or equal to n interpolating a set of prescribed 
values { yi } at the nodes {x*}; 

4. for an equispaced distribution of nodes in [a, b\ the interpolation 
error at any point of [a, b] does not necessarily tend to 0 as n tends 
to infinity. However, there exist special distributions of nodes, for 
instance the Chebyshev nodes, for which this convergence property 
holds true for all continuous functions; 

5. trigonometric interpolation is well suited to approximate periodic 
functions, and is based on choosing / as a linear combination of sine 
and cosine functions. The FFT is a very efficient algorithm which 
allows the computation of the Fourier coefficients of a trigonometric 
interpolant from its node values and admits an equally fast inverse, 
the IFFT. 



3.2 Piecewise linear interpolation 



The Chebyshev interpolant provides an accurate approximation of smooth 
functions / whose expression is known. In the case when / is non smooth 
or when / is only known by its values at a set of given points (which 
do not coincide with the Chebyshev nodes) , one can resort to a different 
interpolation method which is called linear composite interpolation. 

More precisely, given a set of nodes xq < x\ < . . . < x n , we denote 
by Ii the interval [x*,x* + 1 ]. We approximate / by a continuous function 
which, on each interval, is given by the segment joining the two points 
(xi,f(xi)) and (x^+i, /(x$+i)) (see Figure 3.10). This function, denoted 
by nf /, is called a piecewise linear interpolation polynomial and its 
expression is: 



nf /Or) = f( Xi ) + 



f(xi+ 1 ) - f(Xi) 



(x - Xi) 



for x e I{. 



Xi -\- 1 X{ 
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Fig. 3.10. The function f(x) = x 2 + 10/(sin(x) -f 1.2) (solid line) and its 
piecewise linear interpolation polynomial Tl\f (dashed line) 

The upper-index H denotes the maximum length of the intervals C. 

The following result can be inferred from (3.7) setting n — 1 and 
h = H: 



Proposition 3.3 If f G C 2 {I), where I = [xo,x n \, then 

max| f{x) -Ilf/(x)| < ^-max|/"(x)|. 
xei o xei 

Consequently, for all x in the interpolation interval, II i f(x) tends to 
f(x) when H — » 0, provided that / is sufficiently smooth. 

Through the instruction sl=interpl (x,y ,z) one can compute the 
values at arbitrary points, which are stored in the vector z, of the piece- 
wise linear polynomial that interpolates the values y(i) at the nodes 
x(i), for i = 1, . . . ,n+l. Note that z can have arbitrary dimension. If 
the nodes are in increasing order (i.e. x(i+l) > x(i) , for i=l , . . . ,n) 
then we can use the quicker version interplq (q stands for quickly). 

It is worth mentioning that the command fplot, which is used to 
display the graph of a function / on a given interval [a, 6], does in- 
deed replace the function by its piecewise linear interpolant. The set of 
interpolating nodes is generated automatically from the function, follow- 
ing the criterion of clustering these nodes around points where / shows 
strong variations. A procedure of this type is called adaptive. 



interpl 



interplq 
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3.3 Approximation by spline functions 



The main drawback of piecewise linear interpolation is that Ilf / is noth- 
ing more than a global continuous function. As a matter of fact, in several 
applications, e.g. in computer graphics, it is desirable to get approxima- 
tion by smooth functions which have at least a continuous derivative. 

With this aim, we can construct a function 53 with the following prop- 
erties: 

1. on each interval — [xi,x i+ i], for i = 0 , . . . , n — 1 s 3 is a polyno- 
mial of degree 3 which interpolates the pairs of values (xj,f(xj)); 

2. 53 has continuous first and second derivatives on [xq ,x n ]. 

For its complete determination, we need 4 conditions on each inter- 
val, therefore a total of 4n equations, which we can provide as follows: 
n + 1 conditions arise from the interpolation requirement at the nodes 
Xj\ requiring the continuity of the polynomial at the internal nodes 
x \ , . . . , x n -\ yields n — 1 further equations; we obtain 2(n — 1) new equa- 
tions by requiring that both first and second derivatives be continuous 
at the internal nodes. Finally, we still lack two further equations, which 
we can e.g. choose as 



S3OE0) = 0, s^n) = 0. (3.17) 

The function S 3 which we obtain in this way, is called a natural inter- 
polating cubic spline. By choosing suitably the unknowns (see [QSS00], 
Section 8.6.1) to represent 53 we arrive at a (n + 1) x (n + 1) system with 
a tridiagonal matrix whose solution can be accomplished by a number 
of operations proportional to n (see Section 5.4). 

The choice (3.17) is not the only one possible to complete the system 
of equations. Several alternatives do exist. One possibility would be to 
replace (3.17) by requiring the continuity of the third derivative of S3 at 
the nodes x\ and x n -\. This is precisely what we get when using the 
ine MATLAB command spline (see also the toolbox splines). The input 
parameters are the vectors x and y, as well as the vector z containing 
the points at which we are seeking the values of S3. 

Example 3.6 Let us reconsider the data of Table 3.1 corresponding to the 
column K = 0.67 and compute the associated cubic spline S 3 . The different 
values of the latitude provide the nodes x;, i — 0, . . . , 12. If we are interested 
in computing the values 53 ( 2 ;^), where Zi — —55 + z, i — 0, . . . , 120, we can 
proceed as follows: 
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>> x = [-55:10:65]; 

>> y = [-3.25 -3.37 -3.35 -3.2 -3.12 -3.02 -3.02 ... 

-3.07 -3.17 -3.32 -3.3 -3.22 -3.1]; 

» z = [-55:1:65]; 

>> s = spline(x,y,z); 

The graph of S3, which is reported in Figure 3.11, looks more plausible than 
that of the Lagrange interpolant at the same nodes. 




Fig. 3.11. Comparison between the cubic spline and the Lagrange interpolant 
for the case considered in Example 3.6 

The error that we obtain in approximating a function / (continuously 
differentiable up to its fourth derivative) by the natural cubic spline 
satisfies the following inequalities: 

max|/^(x) — S3^(x)| < C r iL 4-r max|/^ 4 ^(x)|, r = 0, 1,2,3, 
xei xei 

where I = [xq , x n ] and H = max*=o,...,n- 1(^+1 — #*), while C r is a 
suitable constant depending on r, but independent of H. It is then clear 
that not only /, but also its first, second and third derivatives are well 
approximated by S3 when H tends to 0. 

Remark 3.2 Cubic splines in general don’t preserve monotonicity between 
neighbouring nodes. For instance, by approximating the unitary circumference 
in the first quarter using the points ( Xk = sin(A;7r/6 ), yk = cos(/c7t/ 6)), for 
k — 0, . . . , 3, we would obtain an oscillatory spline (see Figure 3.12). In these 
cases, other approximation techniques can be better suited. For instance, the 
MATLAB command pchip provides the Hermite piecewise cubic interpolant pchip 
and guarantees the local monotonicity of the interpolant (see Figure 3.12). 

The Hermite interpolant can be obtained by using the following instructions: 
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>> t = linspace(0, pi/2,4) 

>> x = cos(t); y = sin(t); 

>> xx = linspace(0,l,40); 

>> plot(x,y,’s’,xx,[pchip(x,y,xx);spline(x,y,xx)]) 




Fig. 3.12. Approximation of the first quarter of the circumference of the 
unitary circle using only 4 nodes. The dashed line is the cubic spline, whereas 
the continuous line is the piecewise cubic Hermite interpolant 



See the Exercises 3. 5-3. 8. 



3.4 The least squares method 



We have already noticed that a Lagrange interpolation does not guar- 
antee a better approximation of a given function when the polynomial 
degree gets large. This problem can be overcome by composite interpo- 
lation (such as piecewise linear polynomials or splines). However, neither 
are suitable to extrapolate information from the available data, that is, 
to generate new values at points lying outside the interval where inter- 
polation nodes are given. 

Example 3.7 On the basis of the data reported in Figure 3.1, we would like 
to predict whether the stock price will increase or diminish in the coming 
days. The Lagrange polynomial interpolation is impractical, as it would re- 
quire a (tremendously oscillatory) polynomial of degree 719 which will provide 
a completely useless prediction. On the other hand, piecewise linear interpo- 
lation, whose graph is reported in Figure 3.1, provides extrapolated results 
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by exploiting only the values of the last two days, thus completely neglecting 
the previous history. To get a better result we should avoid the interpolation 
requirement, by invoking least square approximation as indicated below. 

Assume that the data {(xi, f(xi)), i = 0, . . . , n} are available. We look 
for a polynomial / of degree at most m > 1 which satisfies the following 
inequality: 



^[fiXi) - < ^2[f(Xi) - Pm(Xi)} 2 

2 — 0 2=0 



(3.18) 



for every polynomial p m of degree at most m. Should it exist, / will 
be called the least square approximation of /. For arbitrary values of m 
and n, it will not be possible to guarantee that f(xi) = f(xi ) for all 
i = 0, . . . , n. 

Setting 



f(x) = a 0 + a\x + . . . + amX m , (3.19) 

where the coefficients ao, . . . ,a m are unknown, the problem (3.18) can 
be restated as follows: 

$(a 0 ,ai,...',a m )= min $(b 0 ,bi, . . . ,b m ) 

{ bi , 2=0,. ..,mj 



where 



$(6 0 , bi,...,b m )='^2 + blXi + • • • + b mXT)] 2 ■ 

2=0 

Let us solve this problem in the special case when m — 1. Since 

n 

®(bo, h) = ^2 [ f( x i ) 2 + b o + b i x i + 2b 0 biXi - 2bof(xi) - 2b 1 x i f(x i )] , 

i=0 

the graph of $ is a convex paraboloid. The point (ao,ai) at which <f> 
attains its minimum satisfies the conditions: 

aS (a "- Ol) = 0 - a4r (o «-“ l) = 0 ’ 

where the symbol d<fr/dbj denotes the partial derivative (that is, the rate 
of variation) of <f> with respect to 6j, after having frozen the remaining 
variable (see the definition 8.3). 
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By explicitly computing the two partial derivatives we obtain 

n n 

y~l[ao + ai Xi - f(xi )] = 0, ^[a 0 Xj + aixf - x i /(x i )] = 0, 

i = 0 i= 0 

which is a system of 2 equations for the 2 unknowns ao and a\\ 



a 0 (n + 1) + oi^ = 7 

U U i=0 n =0 (3.20) 

aoJ^Xi + = '^2,f(x i )x i . 

2=0 2=0 2=0 

Setting D = (n + 1) X^=o X 1 ~~ (S"=o x i) 2 i the solution reads: 



ao= D 



ai= D 



n n 



^2f( x i)^2 x2 j - 

2 = 0 j = 0 j — 0 2=0 



n n 



(»+ DE 

j = 0 i=0 



2 = 0 



(3.21) 



The corresponding polynomial / is known as the least square straight 
line , or regression line. 

The previous approach can be generalized to arbitrary m. The associ- 
ated (ra-fl) x (to + 1) linear system, which is symmetric, will have the 
following form: 



n n n 



a 0 (n + 1) 

n 


+ai'^2 / x i 

i = 0 
n 




■ • H - Q'm'y 

2=0 

71 


= £/(*), 

7 = 0 
TL 


ao5> 

2 = 0 


2=0 




,. + a m J2xT +1 
2=0 


= y ^Xif{xi ), 
2 = 0 


n 

a 0 ^>r 

2=0 


+aiX>r +1 
2 = 0 




71 

. + o m y^xf m 

2=0 


. II 



When m = n, the least square polynomial must coincide with the 
Lagrange interpolating polynomial II n / (see Exercise 3.9). 

The MATLAB command c=polyf it (x,y ,m) computes by default the 
coefficients of the polynomial of degree m which approximates n+1 pairs of 
data (x(i),y(i)) in the least square sense. As already noticed in Section 
3.1.1, when m is equal to n it returns the interpolating polynomial. 
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Example 3.8 In Figure 3.13 we draw the graphs of the least square poly- 
nomials of degree 1, 2 and 4 that approximate the data of Figure 3.1. The 
polynomial of degree 4 reproduces quite reasonably the behavior of the stock 
price in the considered time interval and suggests that in the near future the 
quotation will increase. 




Fig. 3.13. Least squares approximation of the data of Problem 3.2 of degree 
1 (dashed-dotted line), degree 2 (dashed line) and degree 4 (thick solid line). 
The exact data are represented by the thin solid line 



Let us summarize 



1. The composite piecewise linear interpolant of a function / is a 
piecewise continuous linear function /, which interpolates / at a 
given set of nodes {xi}. With this approximation we avoid Runge’s 
type phenomena when the number of nodes increases; 

2. interpolation by cubic splines allows the approximation of / by a 
piecewise cubic function / which is continuous together with its 
first and second derivatives; 

3. in least squares approximation we look for an approximant / which 
is a polynomial of degree m (typically, m < n) that minimizes the 
mean-square error ]C" =0 [/(xi) - /(x, : )] 2 . 



See the Exercises 3.9-3.13. 
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3.5 What we haven’t told you 



For a more general introduction to the theory of interpolation and ap- 
proximation the reader is referred to [Dav63], [Mei67] and [Gau97]. 

Polynomial interpolation can also be used to approximate data and 
functions in several dimensions. In particular, composite interpolation, 
based on piecewise linear or spline functions, is well suited when the 
region D at hand is partitioned into polygons in 2D (triangles or quadri- 
laterals) and polyhedra in 3D (tetrahedra or prisms). 

A special situation occurs when ft is a rectangle or a parallelepiped. In 
interp2 that case one could simply use the commands zi=interp2(x,y ,z ,xi , 
interp3 yi) or vi=interp3(x,y, z,v,xi ,yi ,zi) , respectively, x, v being 

the interpolating values and xi , . . . , zi the nodes at which the inter- 
polating polynomial should be evaluated. 

For instance, to approximate by a cubic spline the values of the func- 
tion f(x,y) = sin(27rx) cos(27n/) on a uniform grid of 36 nodes on the 
square [0, l] 2 , we use the following instructions: 

» [x,y]=meshgrid(0:0.2:l,0:0.2:l); z=sin(2*pi*x).*cos(2*pi*y); 



Then, the cubic interpolating spline, evaluated on a uniform grid of 
441 nodes (21 in both x and y directions), can be obtained as follows: 

» xi = [0:0.05:1]; yi=[0:0.05:l]; 

>> [xf,yf]=meshgrid(0:0. 05: 1,0:0.05:1); pi3=interp2(x,y,z,xf,yf, 'spline'); 



meshgrid 



griddata 



pdetool 



spdemos 



rpmak 

rsmak 



The command meshgrid transforms the domain specified by the vectors 
x and y into arrays xf and yf that can be used for the evaluation of a 
function of two variables and for 3-dimensional MATLAB surface plots. 
The rows of the matrix xf are copies of the vector xi and the columns of 
the matrix yf are copies of the vector yi. Alternatively, one can use the 
function griddata (also available for 3-dimensional data (griddata3) or 
for n-dimensional hypersurface fitting (griddatan). 

When D is a two-dimensional domain of arbitrary shape, it can be 
partitioned into triangles using the graphical interface pdetool. 

For a general presentation of spline functions see, e.g ., [Die93] and 
[PBP02]. The MATLAB toolbox splines allows one to explore several 
applications of spline functions. In particular, the spdemos command 
gives the user the possibility to investigate the properties of the most 
important type of spline functions. Rational splines, i.e. functions which 
are the ratio of two splines functions, are accessible through the com- 
mands rpmak and rsmak. Special instances are the so-called NURBS 
splines, which are commonly used in CAGD ( Computer Assisted Geo- 
metric Design ). 
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In the same context of Fourier approximation, we mention the approx- 
imation based on wavelets. This type of approximation is largely used 
for image reconstruction and compression and in signal analysis (for an 
introduction, see [DL92], [Urb02]). A rich family of wavelets (and their 
applications) can be found in the MATLAB toolbox wavelet. 



3.6 Exercises 



Exercise 3.1 Prove inequality (3.6). 

Exercise 3.2 Provide an upper bound of the Lagrange interpolation error for 
the following functions: 

fi(x) = cosh(x), f 2 (x) = sinh(x), Xk = — 1 + 0.5 /c, k = 0, . . . , 4, 
fs(x) = cos(x) -h sin(x), Xk = — 7t/2 + 7t/c/4, k — 0, . . . , 4. 

Exercise 3.3 The following data are related to the life expectation of citizens 
of two European regions: 





1975 


1980 


1985 


1990 


Western Europe 


72.8 


74.2 


75.2 


76.4 


Eastern Europe 


70.2 


70.2 


70.3 


71.2 



Use the interpolating polynomial of degree 3 to estimate the life expectation in 
1970, 1983 and 1988. Then extrapolate a value for the year 1995. It is known 
that the life expectation in 1970 was 71.8 years for the citizens of the West 
Europe, and 69.6 for those of the East Europe. Recalling these data, is it 
possible to estimate the accuracy of life expectation predicted in the 1995? 



Exercise 3.4 The price (in euros) of a magazine has changed as follows: 



Nov.S 7 


Dec. 88 


Nov. 90 


Jan. 93 


Jan. 95 


Jan. 96 


Nov. 96 


Nov. 00 


4.5 


5.0 


6.0 


6.5 


7.0 


7.5 


8.0 


8.0 



Estimate the price in November 2002 by extrapolating these data. 



Exercise 3.5 Repeat the computations carried out in Exercise 3.3, using now 
the cubic interpolating spline computed by the function spline. Then compare 
the results obtained with the two approaches. 

Exercise 3.6 In the table below we report the values of the sea water density 
p (in Kg/m 3 ) corresponding to different values of the temperature T (in degrees 
Celsius): 



T 


4° 


8° 


12° 


16° 


to 

o 

0 


P 


1000.7794 


1000.6427 


1000.2805 


999.7165 


998.9700 



wavelet 
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Compute the associated cubic interpolating spline on 4 subintervals of the 
temperature interval [4, 20]. Then compare the results provided by the spline 
interpolant with the following ones (which correspond to further values of T): 



T 


6° 


10° 


14° 


18° 


P 


1000.74088 


1000.4882 


1000.0224 


999.3650 



Exercise 3.7 The Italian production of citrus fruit has changed as follows: 



year 


1965 


1970 


1980 


1985 


1990 


1991 


production (xlO 5 Kg) 


17769 


24001 


25961 


34336 


29036 


33417 



Use a cubic interpolating spline to estimate the production in 1962, 1977 and 
1992. Compare these results with the real values: 12380, 27403 and 32059, 
respectively. Repeat the computation using the Lagrange interpolation poly- 
nomial. 

Exercise 3.8 Evaluate the function f(x) — sin(27rx) at 21 equispaced nodes 
in the interval [—1,1]. Compute the Lagrange interpolating polynomial and 
the cubic interpolating spline. Compare the graphs of these two functions 
with that of / on the given interval. Repeat the same calculation using the 
following perturbed set of data: f(xi ) = (— l) z+1 10~ 4 , and observe that the 
Lagrange interpolating polynomial is more sensitive to small perturbations 
than the cubic spline. 

Exercise 3.9 Verify that if m = n the least-square polynomial of a function 
/ at the nodes xo, . . . , x n coincides with the interpolating polynomial II n / at 
the same nodes. 

Exercise 3.10 Compute the least-squares polynomial of degree 4 that ap- 
proximates the data of the second, third and fourth columns of Table 3.1. 

Exercise 3.11 Repeat the computations carried out in Exercise 3.7 using 
now a least-squares approximation of degree 3. 

Exercise 3.12 Express the coefficients of system (3.20) in terms of the aver- 
age M = Xi an d the variance v = — AL) 2 the set 

of data {xi,i = 0, . . . , n}. 

Exercise 3.13 Verify that the regression line passes through the point whose 
abscissa is the average of {xi} and ordinate is the average of {/(:£;)}. 




4. Numerical differentiation and 
integration 



In this chapter we propose methods for the numerical approximation of 
derivatives and integrals of functions. Concerning integration, it is known 
that for a generic function it is not always possible to find a primitive in 
an explicit form. In other cases, it could be hard to evaluate a primitive 
as, for instance, in the example 



7 r 

J cos(4x) cos(3sin(x)) dx 
o 



3 V v ( -9 / 4 ) fc 

V ~*K* + 4)!’ 



where the problem of computing an integral is transformed into the 
equally troublesome one of summing a series. It is also worth mention- 
ing that sometimes the function that we want to integrate or differentiate 
could only be known on a set of nodes (for instance, when the latter rep- 
resent the results of an experimental measurement), exactly as happens 
in the case of the approximation of functions, which was discussed in 
chapter 3. 

In all these situations it is necessary to consider numerical methods 
in order to obtain an approximate value of the quantity of interest, in- 
dependently of how difficult is the function to integrate or differentiate. 



Problem 4.1 (Hydraulics) The height q(t) (in meters) of a flow in a conical 
funnel with a circular hole of radius r = 10 cm is measured at different time 
intervals, yielding the following values: 



10 

0.7366 



15 

0.5430 



We want to compute an approximation of the emptying velo city q (t ) and 
compare it with the analytical one, given by q (t) — — 0.67rr 2 yj 19.6q(t)/A(t) 
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(A(t) is the area of the horizontal section of the cone corresponding to a fluid 
height For the solution of this problem, see Example 4.1. • 

Problem 4.2 (Optics) In order to plan a room for infrared beams we are 
interested in calculating the energy emitted by a black body (that is, an object 
capable of irradiating in all the spectrum to the ambient temperature) in 
the (infrared) spectrum comprised between 3/xm and 14/mi wavelength. The 
solution of this problem is obtained by computing the integral 

14- 10 — 4 

E(T) = 2.39- 10- 11 j x5(elM ^ Tx) _ iy (4.1) 

310- 4 

which is the Planck equation for the energy E(T ), where x is the wavelength (in 
cm) and T the temperature (in Kelvin) of the black body. For the computation 
of the above integral, see Exercise 4.17. • 

Problem 4.3 (Electromagnetism) Consider an electric wire sphere of ar- 
bitrary radius r and conductivity a. We want to compute the density distri- 
bution of the current j as a function of r and t (the time), knowing the initial 
distribution of the current density p(r). The problem can be solved using the 
relations between the current density, the electric field and the charge density 
and observing that, for the symmetry of the problem, j(r, t) = j(r,t) r/|r|, 
where j = | j | . We obtain 



= j(r)e at/s °, j(r) = /p(£)£ 2 d£, (4.2) 

£o r z J 
o 

where £o = 8.859 • 10“ 12 farad/m is the dielectric constant of the void. 

For the computation of j(r), see Exercise 4.16. • 



4.1 Approximation of function derivatives 



Consider a function / : [a, b] — > R continuously differentiable in [a, b\. 
We seek an approximation of the first derivative of / at a generic point 
x in (a, b). 

In view of the definition (1.9), for h sufficiently small and positive, we 
can assume that the quantity 



(5+f)(x) = 



f{x + h) - f (x) 
h 



(4.3) 
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is an approximation of f r {x) which is called the forward finite difference . 
To estimate the error, it suffices to expand / in a Taylor series, obtaining 

f{x + h) = f(x) + hf'(x) + y /"(£) (4.4) 

where £ is a suitable point in the interval (x,x + ft). Therefore 

(6 + f)(x) = f{x) + £/"(£), (4-5) 

and thus ( 5+f)(x ) provides a first-order approximation to f'(x) with 
respect to ft. With a similar procedure, we can derive from the Taylor 
expansion 

f(x -h) = f{x) - hf(x) + y/"(f7) ( 4 -6) 

with r] G (x — /i, x), the backward finite difference 

(4.7) 

which is also first-order accurate. Note that formulae (4.3) and (4.7) can 
also be obtained by differentiating the linear polynomial interpolating / 
at the points {x,x + h} and {x — ft, x}, respectively. In fact, from the 
geometrical viewpoint, these schemes amount to approximating f{x) by 
the slope of the straight line passing through the two points (x, /(x)) 
and (x + ft,/(x + ft)), or (x — ft, /(x-ft)) and (x, /(x)), respectively (see 
Figure 4.1). 

Finally, we introduce the centered finite difference formula 

(4.8) 

which provides a second-order approximation to f f {x) with respect to ft. 
Indeed, we obtain 

f(x) - (5f)(x) = ^[/'"(O + (4.9) 

where r/ and £ are suitable points in the intervals (x — ft, x) and (x, x + ft), 
respectively (see Exercise 4.2). 

By (4.8) /'(x) is approximated by the slope of the straight fine passing 
through the points (x — ft, f(x — ft)) and (x + ft, f(x + ft)). 
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Fig. 4.1. Finite difference approximation of f(x): backward (solid line), for- 
ward (dotted line) and centered (dashed line), mi, m 2 and m 3 denote the 
slopes of the three straight lines 



Example 4.1 Let us solve Problem 4.1, using formulae (4.3), (4.7) and (4.8) 
to approximate q'(t). We obtain: 



t 


0.0 


5 


10 


15 


20 


q'it) 


-0.0220 


-0.0259 


-0.0326 


-0.0473 


-0.1504 


5+q 


-0.0238 


-0.0289 


-0.0387 


-0.0746 


— 


S-q 


— 


-0.0238 


-0.0289 


-0.0387 


-0.0746 


Sq 


— 


-0.0263 


-0.0338 


-0.0567 


— 



The agreement between the exact derivative and the one computed from the 
finite difference formulae with h — 5 is more satisfactory when using formula 
(4.8) rather than (4.7) or (4.3). 



In general, we can assume that the values of / are available at n + 1 
equispaced points Xi = xq + ih, with h > 0. In this case in the numerical 
derivation f'(xi) can be approximated by taking one of the previous 
formulae (4.3), (4.7) or (4.8) with x = 

Note that the centered formula (4.8) can be used only at the internal 
points xi,...,x n _i but not at the extrema xq and x n . At the latter 
nodes we could use the values 



T [— 3/(a; 0 ) + 4/(xi) - f(x 2 )] at x 0 , 
4- [3 f(x n ) ~ 4/(z n _i) + f(x n - 2 )] at x n , 



(4.10) 



which are also second-order accurate with respect to h. They are ob- 
tained by computing at the point xo (respectively, x n ) the first deriva- 
tive of the polynomial of degree 2 interpolating / at the nodes xo, x\,X 2 
(respectively, x n _ 2 , z n _i, x n ). 



See the Exercises 4. 1-4.4. 
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4.2 Numerical integration 



In this section we introduce numerical methods suitable for approximat- 
ing the integral 



b 

1(f) = J f(x) dx, 

a 

where / is an arbitrary continuous function in [a, b\. We start by intro- 
ducing some simple formulae, which are indeed special instances of the 
broader family of Newton- Cotes formulae. 



4.2.1 Midpoint formula 



A simple procedure to approximate 1(f) can be devised by partitioning 
the interval [a, b] into subintervals Ik — [xfc_i,#fc], k = 1 , ...,M, with 
= a + kH , k = 0, . . . , M and H = (6 — a) /M. Since 



M r 

i(f) = E / 

k=i i 



dx , 



(4.11) 



on each sub-interval Ik we can approximate the exact integral of / by 
that of a polynomial / approximating / on 7^. The simplest solution 
consists of choosing / as the constant polynomial interpolating / at the 
middle point of Ik : 



Xk 



Xk—i T Xk 
2 



In such a way we obtain the composite midpoint quadrature formula 



M 

I c mp (f) = H^2f(x k ) 



k= 1 



(4.12) 



The symbol mp stands for midpoint, while c stands for composite. This 
formula is second-order accurate with respect to h. More precisely, if / 
is continuously differentiable up to its second derivative, we have 



i(f) 



I c mp (f) = U -fH 2 f"(Z), 



24 



(4.13) 
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where £ is a suitable point in (a, b) (see Exercise 4.6). Formula (4.12) 
is also called the composite rectangle quadrature formula because of its 
geometrical interpretation, which is evident from Figure 4.2. 




Fig. 4.2. The composite midpoint formula (left); the midpoint formula (right) 

The classical midpoint formula (or rectangle formula) is obtained by 
taking M = 1 in (4.12), i.e. using the midpoint rule directly on the 
interval (a, b): 




The error is now given by 

= (4-14) 

where £ is a suitable point in (a, b). Relation (4.14) follows as a special 
case of (4.13), but it can also be proved directly. Indeed, setting x = 
(a + 6)/2, we have 

b 

1(f) - Imp(f) = J[f{x) ~ f(x)} dx 

a 

b b 

= J f'(x)(x -X)dx + \j f"(r](x))(x - X ) 2 dx, 

a a 

where rj(x) is a suitable point in the interval whose endpoints are x and 
x. Then (4.14) follows because j^(x — x) dx = 0 and, by the mean value 
theorem, 3£ e (a, b) such that 

b b 

\ J f"(v(x))(x - X ) 2 dx = J(x- xf dx = ^ 24 fl ) /"(£)• 

a a 
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The degree of exactness of a quadrature formula is the maximum inte- 
ger r > 0 for which the approximate integral (produced by the quadra- 
ture formula) of any polynomial of degree r is equal to the exact integral. 
Thus, the midpoint formula has degree of exactness 1, since it integrates 
exactly all polynomials of degree less than or equal to 1 (but not all 
those of degree 2). 

The midpoint composite quadrature formula is implemented in Pro- 
gram 3. Input parameters are the endpoints of the integration interval a 
and b, the number of subintervals M and a string f to define the function 
/. In particular, in order to ensure a correct execution of the program 
if / is a constant function, f must be provided in the form ; c+0.*xX 
where c is the constant value of /. 



Program 3 - midpointc: composite midpoint quadrature formula 
function lmp=midpointc(a,b,M,f) 

%MIDPOINTC Composite midpoint numerical integration. 

% IMP = MIDPOINTC(A,B,M,FUN) computes an approximation of the integral 
% of the function FUN via the midpoint method (with M equispaced intervals). 
% FUN accepts real scalar input x and returns a real scalar value. 

% FUN can also be an inline object. 

H=(b-a)/M; 

x = linspace(a+H/2,b-H/2,M); 

fmp=feval(f,x); 

lmp=H*sum(fmp); 



See the Exercises 4. 5-4. 8. 



4.2.2 Trapezoidal formula 



Another formula can be obtained by replacing / on Ik by the linear poly- 
nomial interpolating / at the nodes Xk-i and Xk (equivalently, replacing 
/ by nf /, see Section 3.2, on the whole interval (a, b)). This yields 



J t c (/) = yX^( Xfc ) + f( Xk 



M 



k = 1 



H 



-i)\ 
M - 1 



(4.15) 



= y [/(«) + /(&)] + H X /(**) 



fc= 1 



This formula is called the composite trapezoidal formula , and is second- 
order accurate with respect to H. 
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In fact, one can obtain the expression 

= (4.16) 

for the quadrature error for a suitable point £ e (a, 6), provided that 
/ £ C 2 (a, 6). When (4.15) is used with M — 1, we obtain 



(4.17) 



which is called the trapezoidal formula because of its geometrical inter- 
pretation. The error induced is given by 



It(f) = b ~^[f(a) + f(b)} 



= (4-18) 

where £ is a suitable point in (a, 6), from which we can deduce that (4.17) 
has degree of exactness equal to 1, as is the case of the midpoint rule. 

With a simple modification of this procedure we can obtain a more 
accurate formula. In fact, we can still approximate / by a polynomial, 
but this time we consider as interpolatory points the Gauss nodes 



Xk-l + Xk 
7fc-i - 2 

X k — 1 X k 
Ik = — o — + 



1 fxk-X k -l\ 
2 ) 



= x k -l + 




H 

~ 2 ’ 



1 /x k -x k -i\ 
V3 { 2 ) 



— X k -l + 




H 

~2 m 



With this choice we obtain the Gauss quadrature formula 



TT 

Z k = 0 



(4.19) 




Fig. 4.3. Composite trapezoidal formula (left); trapezoidal formula (right) 
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Its order of accuracy (with respect to H ) is equal to 4. Precisely, 

Hf)-I C Gauss(f)= b ^H 4 f^(0 

where £ is a suitable point in (a, b ) (see [RR85]). Restricted to only one 
interval, (4.19) becomes: 



Ig auss (/) 



(b-a) 



[/( 7i ) + /(7o)] 



(4.20) 



where 

a + b 1 [b — a\ a + 6 1 f b — a\ 

70 = “T - " 71 j ’ 71 = ~ + 71 J • 

The corresponding error is 

for a suitable 77 € (a, 6). In particular, it follows that this formula has 
degree of exactness equal to 3, i.e. it computes exactly the integral of all 
polynomials of degree less than or equal to 3 (but not those of degree 4) . 

The quadrature formulae (4.15) and (4.19) are implemented in Pro- 
gram 4. The selection criterion is to set choice=l for the trapezoidal 
rule, choice=2 for the Gauss rule. The trapezoidal formula (4.17) is im- 
plemented in the MATLAB program trapz, the composite formula (4.15) 
in cumtrapz. 



Program 4 * trapc: composite trapezoidal and Gauss formulae 

function [ltpc]=trapc(a,b,M,f,choice,varargin) 

%TRAPC Composite two-points numerical integration. 

% ITPC = TRAPC(A,B,M, FUN, CHOICE) computes an approximation of the 
% integral of the function FUN via the trapezoidal method (using M equispaced 
% intervals) if CHOICE=l, via the Gauss composite formula if CHOICE=2. 

% FUN accepts real scalar input x and returns a real scalar value. 

% FUN can also be an inline object. 

H=(b-a)/M; 
switch choice 
case 1, 

x=dinspace(a,b,M-bl); 

fpm=feval(f,x,varargin{:}); 

fpm(2:end-l)=2*fpm(2:end-l); 



trapz 

cumtrapz 

❖ 
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ltpc=0.5*H*sum(fpm); 
case 2, 

z=dinspace(a,b,M+l); 
x=z(l:end-l)+H/2*(l-l/sqrt(3)); 
x=[x, x-(-H/sqrt(3)]; 
fpm=feval(f,x,varargin{:}); 
ltpc=0.5*H*sum(fpm); 
otherwise 

disp(’Unknown formula’); 
end 



Example 4.2 We want to compare the approximations of the integral 1(f) = 
J 0 2?r xe~ x cos(2x) dx — —0.122122604618968, obtained by using the composite 
midpoint, trapezoidal and Gauss formulae. In Figure 4.4 we plot on the loga- 
rithmic scale the errors that are obtained versus H. As noticed in Section 1.5, 
in this type of plot the greater the slope of the curve, the higher the order 
of convergence of the corresponding formula. As expected from the theoreti- 
cal results, the midpoint and trapezoidal formulae are second-order accurate, 
whereas the Gauss formula is fourth-order accurate. 




Fig. 4.4. Logarithmic representation of the errors versus H for Gauss (solid 
line with circles) , midpoint (solid line) and trapezoidal (dashed line) composite 
quadrature formulae 

See the Exercises 4.9-4.11. 



4.2.3 Simpson formula 



The Simpson formula can be obtained by replacing the integral of / over 
each Ik by that of its interpolating polynomial of degree 2 at the nodes 
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Xk-i, X k = (x k - 1 +x k )/2 and x k , 



tt 2(x-x fc )(x-x fc ) , 

n 2 /(^) — jj-o f\ x k— i) 



+ 



H 2 

4{x k -i - x)(x - x fc ) 

H 2 



\ , 2(x-x fe )(x-o;fc_i) 

/(**) + jp /(^)- 



The resulting formula is called the Simpson composite quadrature for- 
mula , and reads 



rr 

itu ) = -g-Xl + 4 /(^) + /( x *)] 

° k = 1 



(4.21) 



One can prove that it induces the error 

1(f) - Is(f) = -i wi^ f{4){0 ’ (4 - 22) 

where £ is suitable point in (a, b ), provided that / G C 4 (a, 6). It is there- 
fore fourth-order accurate with respect to H. When (4.21) is applied to 
only one interval, say (a, 6), we obtain the so-called Simpson quadrature 
formula 



W) = b ~^ (f(a) + 4/ ((a + b)/2) + f(b)} 



(4.23) 



The error now is given by 

iu) - w) = ( 4 - 24 ) 

for a suitable £ G (a, b). Its degree of exactness is therefore equal to 3. 
The composite Simpson rule is implemented in Program 5. 



Program 5 - stmpsonc: composite Simpson quadrature formula 

function [lsic]=simpsonc(a,b,M,f,varargin) 

%SIMPSONC Composite Simpson numerical integration. 

% ISIC = SIMPSONC(A,B,M,FUN) computes an approximation of the integral 
% of the function FUN via the Simpson method (using M equispaced intervals). 
% FUN accepts real scalar input x and returns a real scalar value. 

% FUN can also be an inline object. 
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H=(b-a)/M; 

x=linspace(a,b,M+l); 

fpm=feval(f,x,varargin{:}); 

fpm(2:end-l) = 2*fpm(2:end-l); 

lsic=H*sum(fpm)/6; 

x=linspace(a+H/2,b-H/2,M); 

fpm=feval(f,x,varargin{:}); 

Isic = lsic+2*H*sum(fpm)/3; 



Remark 4.1 (Interpolatory quadratures) All quadrature formulae intro- 
duced thus far are remarkable instances of a more general quadrature formula 
of the form: 



j 

Iappr(f) = Yaifiyj) 
3 = 1 



(4.25) 



The real numbers ctj are the quadrature weights , while the points yj are 
the quadrature nodes. In general, one requires that (4.25) integrates exactly at 
least a constant function: this property is ensured if ctj = (b — a). 

By properly chosing the weights and the nodes, the order of accuracy of 
formula (4.25) can be made arbitrarily large. For instance, using the MAT- 
quadl LAB instruction quadl(fun,a,b) (or the obsolete one quad8), it is possible 
quad8 1° compute an integral with a high order Gauss-Lobatto quadrature formula. 

The function fun can be an inline object. For instance, to integrate f(x) = 1/x 
over [1,2], we must first define the following function fun 

fun=inline(T./x',’x’); 

then call quadl(fun,l,2). Note that in the definition of the function / we 
have used an element by element operation (indeed MATLAB will evaluate 
this expression component by component on the vector of quadrature nodes). 

The specification of the number of subintervals is not requested as it is 
automatically computed in order to ensure that the quadrature error is below 
the default tolerance of 10 -3 . A different tolerance can be provided by the 
user through the extended command quadl(fun,a,b,tol). In Section 4.3 we 
will introduce a method to estimate the quadrature error and, consequently, 
to change H adaptively. 



Let us summarize 



1. A quadrature formula is a formula to approximate the integral of 
continuous functions; 
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2. it is generally expressed as a linear combination of the values of the 
function at specific points (called nodes) with coefficients which are 
called weights ; 

3. the degree of exactness of a quadrature formula is the highest de- 
gree of the polynomials which are integrated exactly by the for- 
mula. The degree of exactness is 1 for midpoint and trapezoidal 
formulae, 3 for Gauss and Simpson formulae; 

4. the order of accuracy of a composite quadrature formula is its 
order with respect to the size H of the subintervals. The order of 
accuracy is 2 for composite midpoint and trapezoidal formulae, 4 
for composite Gauss and Simpson formulae. 



See the Exercises 4.12-4.18. 



4.3 Simpson adaptive formula 



The integration step-length H of a quadrature composite formula can 
be chosen in order to ensure that the quadrature error is less than a pre- 
scribed tolerance e > 0. For instance, when using the Simpson composite 
formula, this goal can be achieved by virtue of (4.22), if 



6 — a H 4 
— max 

180 16 a;6[a,6] 



l/ (4) (*)| < 






(4.26) 



where f^ denotes the fourth-order derivative of /. Unfortunately, when 
the absolute value of f^ is large only in a small part of the integra- 
tion interval, the maximum H for which (4.26) holds true can be too 
small. The goal of the adaptive Simpson quadrature formula is to yield 
an approximation of 1(f) within a fixed tolerance e by a non uniform 
distribution of the integration step-sizes in the interval [a, 6]. In such a 
way we retain the same accuracy of the composite Simpson rule, but 
with a lower number of quadrature nodes and, consequently, a reduced 
number of evaluations of /. 

To this end, we must find an error estimator and an automatic proce- 
dure to modify the integration step-length H , according to the achieve- 
ment of the prescribed tolerance. We start by analyzing this procedure, 
which is independent of the specific quadrature formula that one wants 
to apply. 

In the first step of the adaptive procedure, we compute an approxima- 
tion I 8 (f) of 1(f) = f(x) dx . We set H = b — a and we try to estimate 
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the quadrature error. If the error is less than the prescribed tolerance, 
the adaptive procedure is stopped; otherwise the step-size H is halved 
until the integral ff+ H f(x) dx is computed with the prescribed accu- 
racy. When the test is passed, we consider the interval ( a 4- H , b ) and we 
repeat the previous procedure, choosing as the first step-size the length 
b — (a + H ) of that interval. 

We use the following notations: 

1. A: the active integration interval, i.e., the interval where the inte- 
gral is being computed; 

2. S: the integration interval already examined, for which the error is 
less than the prescribed tolerance; 

3. N: the integration interval yet to be examined. 

At the beginning of the integration process we have N = [a, 6], A = N 
and 5 = 0, while the situation at the generic step of the algorithm is 
depicted in Figure 4.5. Let Js(f) indicate the computed approximation 
of ff f(pc)dx , with Js(f) = 0 at the beginning of the process; if the algo- 
rithm successfully terminates, Js(f) yields the desired approximation of 
1(f). We also denote by J( a ,/3)(f) the approximate integral of / over the 
active interval [a,/3\. This interval is drawn in gray in Figure 4.5. The 
generic step of the adaptive integration method is organized as follows: 

1. If the estimation of the error ensures that the prescribed tolerance 
is satisfied, then: 

(i) J s (f) is increased by J( a>/3 )(/), that is J s (f) Js(f) + «/(<*,/?)(/); 

(ii) we let S *— S U A, A = N (corresponding to the path (I) in Figure 
4.5) and a (3 and /? <— b\ 

2. If the estimation of the error fails the prescribed tolerance, then: 

(j) A is halved, and the new active interval is set to A = [a, a'] with 
a! = (a + /?) / 2 (corresponding to the path (II) in Figure 4.5); 

(jj) we let N <- N U [a', /?],/?<- a'; 

(jjj) a new error estimate is provided. 

Of course, in order to prevent the algorithm from generating too small 
step-sizes, it is convenient to monitor the width of A and warn the user, 
in case of an excessive reduction of the step- length, about the presence 
of a possible singularity in the integrand function. 

The problem now is to find a suitable estimator of the error. To this 
end, it is convenient to restrict our attention to a generic subinterval 
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Fig. 4.5. Distribution of the integration intervals at the generic step of the 
adaptive algorithm and updating of the integration grid 

[a,/?] in which we compute I s (f): of course, if on this interval the error 
is less than e(/ 3 — a)/(b — a), then the error on the interval [a, b] will be 
less than the prescribed tolerance e. Since from (4.24) we get 

P 

J f(x) dx - I s (f) = - (/? 2 ~ 8 “ } / (4) (g) = E e {f-, a ,0), 

a 

to ensure the achievement of the tolerance, it will be sufficient to ver- 
ify that E s (f;a,(3) < e{(5 — a)/ (b — a). In practical computation, this 
procedure is not feasible since the point £ E [a, /?] is unknown. 

To estimate the error without using explicitly the value /( 4 )(£), we 
employ again the composite Simpson formula to compute f(x) dx , 
but with a step-length (/3 — a)/2. From (4.22) with a = a and b = /?, we 
deduce that 

I fix ) dx - I c s (f ) = (4.27) 

a 

where rj is a suitable point different from £. Subtracting the last two 
equations, we get 

a/ = >;u) - w) = -^r/ (4, «> + 

Let us now make the assumption that f^(x) is approximately a con- 
stant on the interval [a,/?]. In this case /^(£) — We can com- 

pute f^(rj) from (4.28) and, putting this value in the equation (4.27), 
we obtain the following estimation of the error: 

P 

J fix) dx - Ig{f) ~ ~AI. 
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The step-length (/?— a)/ 2 (that is the step-length employed to compute 
/*(/)) will be accepted if |A/|/15 < e(/? — a)/[2(b - a)]. The quadrature 
formula that uses this criterion in the adaptive procedure described pre- 
viously, is called adaptive Simpson formula. It is implemented in Program 
6. Among the input parameters, f is the string in which the function / is 
defined, a and b are the endpoints of the integration interval, tol is the 
prescribed tolerance on the error and hmin is the minimum admissible 
value of the integration step-length (in order to ensure that the adaption 
procedure always terminates). 



Program 6 - simpadpt: adaptive Simpson formula 



function [JSf,nodes]=simpadpt(f,a,b, tol, hmin) 

%SIMPADPT Numerically evaluate integral, adaptive Simpson quadrature. 
% JSF = SIMPADPT(FUN,A,B,TOL,HMIN) tries to approximate the 
% integral of function FUN from A to B to within an error of TOL using 
% recursive adaptive Simpson quadrature. The inline function Y = FUN(V) 
% should accept a vector argument V and return a vector result Y, the 
% integrand evaluated at each element of X. 

% 

% [JSF, NODES] = SIMPADPT(...) returns the distribution of nodes. 
A=[a,b]; N=[]; S=[]; JSf = 0; ba = b - a; nodes=[]; 
while ~isempty(A), 

[deltal , ISc]=caldeltai(A,f) ; 

if abs(deltal) <= 15*tol*(A(2)-A(l))/ba; 

JSf = JSf + ISc; 

S = union(S,A); 

nodes = [nodes, A(l) (A(l)+A(2))*0.5 A(2)]; 

S - [S(l), S(end)]; A = N; N = []; 
elseif A(2)-A(l) < hmin 
JSf= JSf-hISc; 

S = union(S,A); 

S = [S(l), S(end)]; A=N; N=[]; 
warningfToo small step-length’); 
else 

Am = (A(l)+A(2))*0.5; 

A = [A(l) Am]; 

N = [Am, b] ; 
end 
end 

nodes=unique(nodes); 

return 



function [deltal , ISc]=caldeltai(A,f ) 
L=A(2)-A(1); 
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t=[0; 0.25; 0.5; 0.5; 0.75; 1]; 
x=L*t+A(l); 

L=L/6; 
w=[l; 4; 1]; 
fx=feval(f,x); 

IS— L*sum(fx([l 3 6]).*w); 

ISc=0.5*L*sum(fx.*[w;w]); 

deltal=IS-ISc; 

return 



Example 4.3 Let us compute the integral 1(f) = f\ x e 10 ^ x ^ dx by the 
adaptive Simpson formula. Running Program 6 with tol = 10“ 4 and hmin = 
10“ 3 provides 0.28024765884708, instead of the exact value 0.28024956081990. 
The error is less than the prescribed tolerance tol=10“ 5 . To obtain this result 
it was sufficient to use only 10 nonuniform subintervals. Note that the cor- 
responding composite formula with uniform step-size would have required 22 
subintervals to ensure the same accuracy. 
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The midpoint, trapezoidal and Simpson formulae are particular cases of 
a larger family of quadrature rules known as Newton- Cotes formulae. 
For an introduction, see [QSS00], Chapter 10. Analogously, the Gauss 
formula (4.20) is a special case of the important family of Gaussian 
quadrature formulae. These are optimal in the sense that they maxi- 
mize the degree of exactness for a given number of quadrature nodes. In 
MATLAB 6 the function quadl implements one of these formulae. For 
an introduction to the Gaussian formulae, see [QSS00], Chapter 10 or 
[RR85]. Further developments on numerical integration can be found, 
e.g., in [DR75] and [PdDKUK83]. 

Numerical integration can also be used to compute integrals on un- 
bounded intervals. For instance, to approximate J 0 °° f(x) dx , a first pos- 
sibility is to find a point a such that the value of f(x)dx can be 
neglected with respect to that of f(x)dx. Then we compute by a 
quadrature formula this latter integral on a bounded interval. A second 
possibility is to resort to Gaussian quadrature formulae for unbounded 
intervals (see [QSS00], Chapter 10). 

Finally, numerical integration can also be used to compute multidi- 
mensional integrals. In particular, we mention the MATLAB instruction 
dblquad( ’f 9 ,xmin,xmax,ymin,ymax) by which it is possible to com- 
pute the integral of a function contained in the MATLAB file f . m over 
the rectangular domain [xmin,xmax] x [ymin , ymax] . Note that the 



dblquad 
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function f must have at least two input parameters corresponding to the 
variables x and y with respect to which the integral is computed. 



4.5 Exercises 



Exercise 4.1 Verify that, if / G C 3 in a neighborhood Io of xo (respectively, 
I n of x n ) the error of the formula (4.10) is equal to — \f'"{£ > o)h 2 (respectively, 
— where £o and £ n are two suitable points belonging to Io and / n , 

respectively. 

Exercise 4.2 Verify that if / G C 3 in a neighborhood of x the error of the 
formula (4.8) is equal to (4.9). 

Exercise 4.3 Compute the order of accuracy with respect to h of the follow- 
ing formulae for the numerical approximation of f'(xi): 

-11 f(xj) + 18/(gj+i) - 9 f(xj+ 2 ) + 2 f(xj+ 3 ) 

6h ’ 

h S(xj- 2 ) - 6f(xj- 1 ) + 3 f(xj) + 2f(x i+ i) 

6h 

f(Xi-2 ) - Sf(xi) + 8/(x i+ 1) - /(Xi+ 2 ) 

C ‘ 12 h 

Exercise 4.4 The following values represent the time evolution of the number 
n(t) of individuals of a given population whose birth rate is constant ( b = 2) 
and mortality rate is d{t) = 0.01n(t): 



t (months) 


0 


0.5 


1 


1.5 


2 


2.5 


3 


n 


100 


147 


178 


192 


197 


199 


200 ' 



Use these data to approximate as accurately as possible the rate of variation 
of this population. Then compare the obtained results with the exact rate 
n'(t) = 2 n(t) — 0.01 n 2 (t). 



Exercise 4.5 Find the minimum number M of subintervals to approximate 
with an absolute error less than 10 -4 the integrals of the following functions: 



h(x) 



1 



1 + (x — 7 r) 2 

/sW = 



over (0,5), f 2 (x) = e x cos(x) over (0, 7r), 

1 



y/xiy-x) 



over (0, 1), 



using the midpoint composite formula. Verify the results obtained using the 
Program 3. 



Exercise 4.6 Prove (4.13) starting from (4.14). 
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Exercise 4.7 Why does the midpoint formula lose one order of convergence 
when used in its composite mode? 

Exercise 4.8 Verify that, if / is a polynomial of degree less than or equal 1, 
then Imp(f) = 1(f) he. the midpoint formula has degree of exactness equal to 
1 . 

Exercise 4.9 For the function fi of Exercise 4.5, compute (numerically) the 
values of M which ensure that the quadrature error is less than 10 ~ 4 when the 
integral is approximated by the composite trapezoidal and Gauss quadrature 
formulae. 

Exercise 4.10 Let Ii and I 2 be two values obtained by the composite trape- 
zoidal formula applied with two different step- lengths, Hi and H 2 , for the 
approximation of 1(f) = f(x)dx. Verify that, if / ^ has a mild variation 
on (a, 6), the value 

I R = h + (h - I 2 )l{Hl/Hl - 1) (4.29) 

is an approximation of 1(f) better than I\ and I 2 . This strategy is called the 
Richardson extrapolation method. 

Exercise 4.11 Verify that, among all formulae of the form I app x(f) = af(x) + 
(3f(z) where x,z 6 [a, b] are two unknown nodes and a and (3 two undeter- 
mined weights, formula (4.20) features the maximum degree of exactness. 

Exercise 4.12 For the first two functions of Exercise 4.5, compute the min- 
imum number of intervals such that the quadrature error of the composite 
Simpson quadrature formula is less than 10 -4 . 

Exercise 4.13 Compute / Q 2 e~ x ^ dx by the Simpson formula (4.23) and 
Gauss formula (4.20), then compare the obtained results. 

Exercise 4.14 To compute the integrals Ik = x k e x ~ 1 dx for k = 1,2,..., 
one can use the following recursive formula: Ik = 1 — klk- 1 , with I\ — 1/e. 
Compute 1 20 by the composite Simpson formula in order to ensure that the 
quadrature error is less than 10~ 3 . Compare the Simpson approximation with 
the result obtained by the above recursive formula. 

Exercise 4.15 Apply the Richardson extrapolation formula (4.29) for the 
approximation of the integral 1(f) = J Q 2 e~ x 12 dx, with Hi = 1 and H 2 = 0.5 
using first Simpson formula (4.23), then the Gauss formula (4.20). Verify that 
Ir is more accurate than Ii and h- 

Exercise 4.16 Compute by the composite Simpson formula the function j(r) 
defined in (4.2) for r = k/ 10 m with k — 1, . . . , 10, with p(g) = e * and a = 0.36 
W/(mK). Ensure that the quadrature error is less than 10~ 10 . (Recall that: 
m=meters, W^watts, K=degrees Kelvin.) 
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Exercise 4.17 By using the composite Simpson and Gauss formulae compute 
the function E(T ), defined in (4.1), for T equal to 213 K, up to at least 10 
exact significant digits. 

Exercise 4.18 Develop a strategy to compute 1(f) = \x 2 — 0.25| dx by the 

composite Simpson formula such that the quadrature error is less than 10~ 2 . 




5 



Linear systems 



In applied sciences, one is quite often led to face a linear system of the 
form 



Ax = b (5.1) 

for the solution of complex problems, where A is a square matrix of 
dimension n x n whose elements are either real or complex, while 
x and b are column vectors of dimension n with x representing the 
unknown solution and b is a given vector. Component- wise, (5.1) can be 
written as 



CL\l x \ + ai 2^2 + • . • + CL\n x n ~ &1 > 
CL2l x l + CL22 X 2 + • • • “f &2n x n = 



Q j nl x l 4 ~ CL n 2 x 2 CLnn x n — 

We present three different problems that give rise to linear systems. 

Problem 5.1 (Hydraulics) Let us consider the hydraulic network made of 
the 10 pipelines in Figure 5.1, which is fed by a reservoir of water at constant 
pressure p r = 10 bar. In this problem, pressure values refer to the difference 
between the real pressure and the atmospheric one. For the j- th pipeline, 
the following relationship holds between the flow-rate Qj (in m 3 /s) and the 
pressure gap A pj at pipe-ends: 

Qj = kLApj , (5.2) 

where k is the hydraulic resistance (in m 2 /(bar s)) and L is the length (in m) 
of the pipeline. We assume that water flows from the outlets (indicated by a 
black dot) at atmospheric pressure, which is set to 0 bar for coherence with 
the previous convention. 
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A typical problem consists of determining the pressure values at every inter- 
nal node 1, 2, 3, 4. With this aim, for each j = 1, 2, 3, 4 we can supplement the 
relationship (5.2) with the statement that the algebraic sum of the flow-rates 
of the pipelines which meet at j - th node must be null (a negative value would 
indicate the presence of a seepage). 

Denoting by p = (pi,P2,P3,P4) T the pressure vector at the internal nodes, 
we get a 4 x 4 system of the form Ap = b. 

Let us summarize in the following table the relevant characteristics of the 
different pipelines. 



pipeline 


k 


L 


pipeline 


k 


L 


pipeline 


k 


L 


1 


0.01 


20 


2 


0.005 


10 


3 


0.005 


14 


4 


0.005 


10 


5 


0.005 


10 


6 


0.002 


8 


7 


0.002 


8 


8 


0.002 


8 


9 


0.005 


10 


10 


0.002 


8 















Correspondingly, A and b take the following values (only the first 4 significant 
digits are provided): 



" -0.370 


0.050 


0.050 


0.070 




' -2 ' 


0.050 


-0.116 


0 


0.050 


u 


0 


0.050 


0 


-0.116 


0.050 


, D — 


0 


0.070 


0.050 


0.050 


-0.202 




0 



The solution of this system is postponed to Example 5.4. • 




Problem 5.2 (Spectrometry) Let us consider a gas mixture of n non reac- 
tive unknown components. By the mass spectrometer the compound is bom- 
barded by low-energy electrons: the resulting mixture of ions is analyzed by a 
galvanometer which shows peaks corresponding to specific ratios mass/charge. 
We consider only the n most relevant peaks. One may conjecture that the 
height hi of the i-th peak is a linear combination of {pj,j = 1 ,...,n}, pj 
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being the partial pressure of the j-th component (that is the pressure exerted 
by a single gas when it is part of a mixture), yielding 

n 

^2 s ijPj = hi, i = l,...,n, 

3 = 1 

where the Sij are the so-called sensitivity coefficients. The determination of 
the partial pressures demands therefore the solution of a linear system. • 

Problem 5.3 (Economy: input-output analysis) We want to determine 
the situation of equilibrium between demand and offer of certain goods. In 
particular, let us consider n factories that produce n different products. They 
must face the internal demand of goods necessary to the factories for their 
own production, as well as the external demand from the consumers. Let Xi, 
i = 1 , . . . , n, denote the number of units of the z-th product of the i-th factory 
and let bi,i = 1 , . . . , n, denote the number of units of the i-th product absorbed 
by the market. Finally, let dj be the fraction of Xi necessary to the j-th. 
factory for the production of the j-th product (see Figure 5.2). According to 
the Leontief model, we assume that the transformation functions that relate 
the various problems are linear. Then the equilibrium is reached when the 
vector x of the total production equals the total demand, that is, x = Cx + b, 
where C = (dj) and b = (bi). Thus, in this model the total production satisfies 
the linear system 



Ax = b, where A = I — C. 
For its solution, see Exercise 5.17. 



(5.3) 



I c77 

” n 



j/wl 
| 


* 0 - 

C \ 2 | 


f 4 

L 


C22 

, , P Q, 


u * trd 

C31 


/ml 


1 "fr? 



t C33 



Fig. 5.2. The interaction scheme of 3 factories with the market 



The solution of the system (5.1) exists iff A is non singular. In princi- 
ple, the solution might be computed by the so-called Cramer rule : 

det(Aj) . 

I,= dS(Aj-’ ' = 1 B ' 
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where A* is the matrix obtained from A by replacing the i-th column by 
b and det(A) denotes the determinant of A. If the n + 1 determinants are 
computed by the Laplace expansion (see Exercise 5.1), a total number 
of 2 (n + 1)! ops is required. As usual, by operation we mean a sum, a 
subtraction, a product or a division. For instance, a computer capable of 
carrying out 10 9 flops (i.e. 1 giga ops per second), would require about 
12 hours to solve a system of dimension n — 15, 3240 years if n = 20 and 
10 143 years if n — 100. The computational cost can be drastically reduced 
to the order of about n 3,8 ops if the n - hi determinants are computed 
by the algorithm quoted in Example 1.3. Yet, this cost is still too high 
for the large values of n which often arise in practical applications. 

We need therefore to look for alternative approaches. Let us note that 
in general a system cannot be solved by less than n 2 ops. Indeed, if the 
equations are fully coupled, we should expect that every one of the n 2 
matrix coefficients would be involved in an operation at least once. 



5.1 The LU factorization method 



Let A be a square matrix of order n. Assume that there exist two suitable 
matrices L and U, lower triangular and upper triangular, respectively, 
such that 



A = LU. (5.4) 

We call (5.4) an LU -factorization (or decomposition) of A. Should A be 
non-singular, so are both L and U, and thus their diagonal elements are 
non- null (as observed in Section 1.3). 

In such a case, solving Ax = b leads to the solution of two triangular 
systems 



Ly = b, Ux = y. 



(5.5) 



Both systems are easy to solve. Indeed, L being lower triangular, the 
first row of the system Ly = b takes the form: 



hiVi = 5i, 



which provides the value of yi since In ^ 0. By substituting this value 
of yi in the subsequent n — 1 equations we obtain a new system whose 
unknowns are 7 / 2 , • • • , 2/n, on which we can proceed in a similar manner. 
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Proceeding forward, equation by equation, we can compute all unknowns 
with the following forward substitutions algorithm: 




Let us quantify the number of operations required from (5.6). Since 
i — 1 sums, i — 1 products and 1 division are needed to compute the 
unknown 2 /*, the total number of operations required is 

±i + 2±(i-l) = 2±i-n = n 2 . 

i= 1 i— 1 i= 1 

The system Ux = y can be solved by proceeding in a similar manner. 
This time, the first unknown to be computed is x n then, by proceeding 
backward, we can compute the remaining unknowns x*, for i = n — 1 to 
i = l: 




This is called backward substitutions algorithm and requires n 2 ops too. 
At this stage we need an algorithm that allows an effective computation 
of the factors L and U of the matrix A. We illustrate a general procedure 
starting from a couple of examples. 

Example 5.1 Let us write the relation (5.4) for a generic matrix A G R 2x2 

111 0 2211 2212 _ ail ai2 

hi I22 J |_ 0 2222 J _ a2i a22 

The 6 unknown elements of L and U must satisfy the following (non-linear) 
equations: 



(ei) liiun = an, (e 2 ) I11U12 = ai 2 , 

(^3) hiUn = a2i, (e±) I21U12 + I22U22 = a22- 



(5.8) 
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System ( 5 . 8 ) is underdetermined as it features less equations than unknowns. 
We can complete it by assigning arbitrarily the diagonal elements of L, for 
instance setting In = 1 and I22 = 1 . Now system ( 5 . 8 ) can be solved by 
proceeding as follows: we determine the elements un and U12 of the first row 
of U using (ei) and (62)- If un is nonnull then from (e^) we deduce I21 (that is 
the first column of L, since In is already available). Now we can obtain from 
(e4) the only non zero element U22 of the second row of U. 

Example 5.2 Let us repeat the same computations in the case of a 3 x 3 
matrix. For the 12 unknown coefficients of L and U we have the following 9 
equations: 

(ei) Inun = an, (e2) I11U12 — 012, (^3) I11U13 = ai 3 , 

(e 4 ) I21U11 = CL21, (es) I21U12 + I22U22 = 0/22 5 (ee) I21U13 + I22U23 = 023, 

(67) hiun — 031, (es) I31U12 + I32U22 = 032, (eg) /31O13 + I32U23 + I33U33 

= 033. 

Let us complete this system by setting la = 1 for i = 1 , 2 , 3 . Now, the 
coefficients of the first row of U can be obtained by using (ei), (e2) and (es). 
Next, using (e4) and (er), we can determine the coefficients I21 and Z31 of the 
first column of L. Using (es) and (ee) we can now compute the coefficients U22 
and U23 of the second row of U. Then, using (es), we obtain the coefficient I32 
of the second column of L. Finally, the last row of U (which consists of the 
only element U33) can be determined by solving (eg). 

On a matrix of arbitrary dimension n we can proceed as follows: 

1 . The elements of L and U satisfy the system of non-linear equations 

^ ^ lir^rj 0 >ij , UJ 1 , ... , 72 , ( 5 * 9 ) 

r— 1 

2. System (5.9) is under determined; indeed there are n 2 equations 
and n 2 + n unknowns, thus the factorization LU cannot be unique; 

3. By forcing the n diagonal elements of L to be equal to 1, (5.9) turns 

into a determined system which can be solved by the following 
Gauss algorithm: set A^ 1 ) = A i.e. for i, j = 1, . . . , n; for 

k — 1 , . . . , n — 1 do 




(5.10) 
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(k) 

The elements a \ must all be different from zero and are called pivot 
elements. For every fc = 1, . . . , n — 1 the matrix = (a^ +1 ^) has 

n — k rows and columns. 

At the end of this procedure the elements of the upper triangular 
matrix U are given by Uij = af^ for i = 1 , ...,n and j = z, ...,n, 
whereas those of L are given by the coefficients lij generated by this 
algorithm. In (5.10) there is no computation of the diagonal elements of 
L, as we already know that their value is equal to 1. 

This factorization is called the Gauss factorization ; the determination 
of the elements of the factors L and U requires about 2n 3 /3 ops (see 
Exercise 5.4). 



Example 5.3 Consider the following Vandermonde matrix 

A = (aij) with aij — x^ n ~^\ i,j = 1, . . . ,n, (5.11) 

where the xi are n distinct abscissae. It can be constructed using the MAT- 
LAB command vander. In Figure 5.3 we report the number of floating point vander 
operations required in order to compute the Gauss factorization of A. Sev- 
eral values of n are considered in abscissae, and the corresponding number of 
ops are indicated with circles. They have been obtained using the following 
commands: 

>> ops = [ ]; for n = [10:10:100] 

A = vander(linspace(l,2,n)); 
flops(0), A = lu_gauss(A); 
ops= [ops, flops]; 
end 

The curve reported in the picture is a polynomial in n of the third order 
representing the least square approximation of the above data, and is generated 
by the following commands: 

>> c3=polyfit([10:10:100],ops,3) 

0.6667 -0.0000 0.3333 -0.0000 

The coefficient of the monomial n 3 is exactly 2/3. 

It is not necessary to store the matrices {A^}; actually we can overlap 
the (n — k) x (n — k) elements of A^ +1 ^ on the corresponding last (n — 
k) x (n — fc) elements of the original matrix A. Moreover, since at the fc-th 
step, the subdiagonal elements of the fc-th column don’t have any effect 
on the final U, they can be replaced by the entries of the k- th column 
of L, as done in Program 7. Then, at the fc-th step of the process the 
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Fig. 5.3. The number of ops as a function of n, necessary to generate the 
Gauss factorization LU of the Vandermonde matrix. This function is a cubic 
polynomial obtained by approximating in the least square sense the values 
(represented by circles) corresponding to n = 10, 20, . . . , 100 



elements stored at location of the original entries of A are 



r a (1) 

a ll 

hi 


a (1) 

a 12 

a (2) 

u 22 




a (1) 

• °1 n 

a (2) 

a 2n 


hi 


• • • h,k— 1 


(k) 

a kk ■ 


( k ) 

•• a kn 


hi 


• • • h,k—l 


(fe) 

a nk • 


( k ) 

• • dnn 


_ 









lu 

inv 



\ 



where the boxed submatrix is . The Gauss factorization is the basis 
of several MATLAB commands: 

[L,U]=lu(A) whose mode of use will be discussed in Section 5.2; 

inv that allows the computation of the inverse of a matrix; 

\ by which it is possible to solve a linear system with matrix A and 
right hand side b by simply writing A \ b. 



Remark 5.1 (Computing a determinant) By means of the LU factoriza- 
tion one can compute the determinant of A with a computational cost of G(n 3 ) 
operations, noting that (see Sect. 1.3) 

n 

det(A) = det(L) det(U) = 

k= 1 
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As a matter of fact, this procedure is also at the basis of the MATLAB com- 
mand det. 

In Program 7 we implement the algorithm (5.10). The factor L is stored in 
the (strictly) lower triangular part of A and U in the upper triangular part 
of A (for the sake of storage saving). After the program execution, the two 
factors can be recovered by simply writing: L = eye(n) + tril(A,-l) 
and U = triu(A), where n is the size of A. 



Program 7 - lu gauss: Gauss factorization 
function A=lu_gauss(A) 

%LU_GAUSS LU factorization without pivoting. 

% A = LU_GAUSS(A) stores an upper triangular matrix in the upper triangular 
% part of A and a lower triangular matrix in the strictly lower part of A 
% (the diagonal elements of L are 1). 

[n,m]=size(A); 

if n~= m; error(’A is not a square matrix’); else 
for k = l:n-l 
for i = k+l:n 
A(i,k) - A(i,k)/A(k,k); 

if A(k,k) == 0, error('Null diagonal element’); end 
for j = k+l:n 

A(ij) = A(i,j) - A(i,k)*A(k,j); 
end 
end 
end 
end 

Example 5.4 Let us compute the solution of the system encountered in the 
Problem 5.1 by using the LU factorization, then applying the backward and 
forward substitution algorithms. We need to compute the matrix A and the 
right hand side b and execute the following instructions: 

>> A=lu_gauss(A); 

>> y(l)=b(l); for i = 2:4; y = [y; b(i)-A(i, l:i-l)*y(l:i-l)] ; end 
» X (4)=y(4)/A(4 f 4); 

>> for i = 3:-l:l; x(i)= (y(i)-A(i,i+l:4)*x(i+l:4))/A(i,i); end 
The result is p = (8.1172, 5.9893, 5.9893, 5.7779) T . 

Example 5.5 Suppose that we solve Ax = b with 





‘ 1 


l — £ 


3 ' 




5 — £ 


A = 


2 


2 


2 


, b = 


6 




3 


6 


4 _ 




13 



det 

❖ 



s G K, 



(5.12) 
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whose solution isx=(l,l,l) T (independently of the value of e). 

Let us set e = 1. The Gauss factorization of A obtained by the Program 7 
yields 



1 


0 


0 " 




" 1 


0 


3 


2 


1 


0 


, u = 


0 


2 


-4 


3 


3 


1 




0 


0 


7 



If we set e = 0, despite the fact that A is non singular, the Gauss factoriza- 
tion cannot be carried out since the algorithm (5.10) would involve divisions 
by 0. 

The previous example shows that, unfortunately, the Gauss factoriza- 
tion A=LU does not necessarily exist for every non singular matrix A. 
In this respect, the following result can be proven: 

Proposition 5.1 For a given matrix A G R nxn , its Gauss factoriza- 
tion exists if and only if the principal submatrices A i of A of order 
i — l,...,n — 1 (that is those obtained by restricting A to its first i 
rows and columns) are non singular. This factorization is unique if A is 
non singular. 

Going back to Example 5.5, we can notice that when e = 0 the second 
principal submatrix A 2 of the matrix A is singular. 

We can identify special classes of matrices for which the hypotheses 
of Proposition 5.1 are fulfilled. In particular, we mention: 

1. Symmetric and positive definite matrices. A matrix A G R nxn is 
positive definite if 

Vx G M n with x^O, x t Ax > 0; 

2. Diagonally dominant matrices. A matrix is diagonally dominant by 
row if 



n 

Wii\ > i — 1, . . . ,n, 

3 = 1 



by column if 

n 

|a«|>^l a ji|, i = l,...,n. 

3 = 1 

A special case occurs when in the previous inequalities we can 
replace > by >. Then the matrix A is called strictly diagonally 
dominant (by row or by column, respectively). 
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If A is symmetric and positive definite, it is moreover possible to con- 
struct a special factorization: 

A = HH t , (5.13) 



where H is a lower triangular matrix with positive diagonal elements. 
This is the so-called Cholesky factorization and requires about n 3 / 3 op- 
erations (half of those required by the Gauss LU factorization). Further, 
let us note that, due to the symmetry, only the lower part of A is stored, 
and H can be stored in the same area. 

The elements of H can be computed by the following algorithm: for 
i = 1, . . . , n, 



1 



hjj 




j = l,...,z- 1, 



2-1 




Cholesky factorization is available in MATLAB by setting H=chol(A). chol 



See the Exercises 5. 1-5.5. 



5.2 The technique of pivoting 




We are going to introduce a special technique that allows us to achieve 
the LU factorization for every non singular matrix, even if the hypotheses 
of Proposition 5.1 are not fulfilled. 

Let us go back to the case described in Example 5.5 and take e — 0. 
Setting A^ 1 ) = A after having carried out the first step (k = 1) of the 
procedure, the new entries of A are 



(5.14) 



Since the pivot a<i 2 is equal to zero, this procedure cannot be continued 
further. On the other hand, should we interchange the second and third 
rows beforehand, we would obtain the matrix 
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and thus the factorization could be accomplished without involving a 
division by 0. 

We can state that permutation in a suitable manner of the rows of the 
original matrix A would make the entire factorization procedure feasible 
even if the hypotheses of Proposition 5.1 are not verified. Unfortunately, 
we cannot know a priori which rows should be permuted. However, this 
decision can be made at every step k at which a null diagonal element 
is generated. 

Let us return to the matrix A^ 2 ) in (5.14): since is null, let us 
interchange the third and second row of A^ 2 ^ and check whether the 
new q >22 that is generated is still null. By executing the second step of 
the factorization procedure we find the same matrix that we would have 
generated by an a priori permutation of the same two rows of A. 

We can therefore perform a row permutation as soon as this becomes 
necessary, without carrying out any a priori transformation on A. Since 
a row permutation entails changing the pivot element, this technique is 
given the name of pivoting by row. The factorization that is generated in 
this way returns the original matrix up to a row permutation. Precisely 
we obtain 



PA = LU. 



P is a suitable permutation matrix. At the beginning it is set equal to 
the identity matrix, then if in the course of the procedure the rows r 
and s of A are permuted, the same permutation must be performed on 
the homologous rows of P. Correspondingly, we should now solve the 
following triangular systems 

Ly = Pb, Ux = y. (5.15) 

From the second equation of (5.10) we see that not only null pivot ele- 
ments are troublesome, but also those which are very small. Indeed, 

should be near zero, possible roundoff errors affecting the coefficients 
(k) 

a K k j will be severely amplified. 



Example 5.6 Consider the nonsingular matrix 



A = 



1 1 + 0.5 -io -15 3 

2 2 20 

3 6 4 



During the factorization procedure by Program 7 no null pivot elements are 
obtained. Yet, the factors L and U turn out to be quite inaccurate, as one can 




5.3 How accurate is the LU factorization? 
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realize by computing the residual matrix A — LU (which should be the null 
matrix if all operations were carried out in exact arithmetic): 



A-LU- 



0 0 0 
0 0 0 
0 0 6 



It is therefore recommended to carry out the pivoting at every step 
of the factorization procedure, by searching among all virtual pivot el- 
ements with i = A\ . . . , n, the one with maximum modulus. The 
algorithm (5.10) with pivoting by row carried out at every step takes the 
following form: for k = 1 , . . . , n, do 



for i = k + 1 , . . . , n 

find f such that |a^| = max |a^?|, 

r=k,...,n 

exchange row k with row r , 



(k) 



I k ~ ai - 
Hk (k) ’ 

a kk 

for j ~ k + 1 , . . . , n 

„(fc+i) 



a i? - l ika ( ki 



(5.16) 



The MATLAB program lu that we have mentioned previously com- 
putes the Gauss factorization with pivoting by row. Its complete syntax 
is indeed [L,U,P]=lu(A), P being the permutation matrix. When called 
in the shorthand mode [L ,U] =lu(A) , the matrix L is equal to P*M, where 
M is lower triangular and P is the permutation matrix generated by the 
pivoting by row. The program lu activates automatically the pivoting 
by row when a null (or very small) pivot element is computed. 

See the Exercises 5. 6-5. 8. 



5.3 How accurate is the LU factorization? 



We have already noticed in Example 5.6 that, due to roundoff errors, 
the product LU does not reproduce A exactly. Even though the pivoting 
strategy damps these errors, yet the result could sometimes be rather 
unsatisfactory. 

Example 5.7 Consider the linear system A n x n = b n where A n € M nXn is 
the so-called Hilbert matrix whose elements are 
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a H = V(* +3 - 1), i,j = l,---,n, 

while b „ is chosen in such a way that the exact solution is x n = (1,1,..., 1) T . 
The matrix A n is clearly symmetric and one can prove that it is also positive 
definite. 

For different values of n we use the MATLAB function lu to get the Gauss 
factorization of A n with pivoting by row. Then we solve the associated linear 
systems (5.15) and denote by x n the computed solution. In Figure 5.4 we 
report (in logarithmic scale) the relative errors 

E n = ||x n - X n ||/||x n ||, (5-17) 

having denoted by || • || the Euclidean norm introduced in the Section 1.3.1. 
We have E n > 10 if n > 13 (that is a relative error on the solution higher 
than 1000%!), whereas R n = L n U n — P n A n is the null matrix (up to machine 
accuracy) for whatever value of n. 




Fig. 5.4. Behavior versus n of E n (solid line) and of maxij=i,..., n |r^| (dashed 
line) in logarithmic scale, for the Hilbert system of Example 5.7. The (rij) are 
the coefficients of the matrix R 

On the ground of the previous remark, we could speculate by saying 
that, when a linear system Ax = b is solved numerically, one is indeed 
looking for the exact solution x of a perturbed system 

(A + SA)x. = b + (5b, (5.18) 

where 5 A and 8 b are respectively a matrix and a vector which depend 
on the specific numerical method which is being used. We start by con- 
sidering the case where SA = 0 and 8b / 0 which is simpler than the 
most general case. Moreover, for simplicity we will also assume that A 
is symmetric and positive definite. 

By comparing (5.1) and (5.18) we find x — x = — A _1 £b, and thus 

||x - x|| = ||A _1 5b||. (5.19) 
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In order to find an upper bound for the right hand side of (5.19), we 
proceed as follows. Since A is symmetric and positive definite, the set 
of its eigenvectors {v*}^ furnishes an orthonormal basis of M n (see 
[QSSOO], chapter 5). This means that 

A Vi = A iVi, i = 1, . . . ,n, 

v f v i = M = 1) • • • 

where A* is the eigenvalue of A associated with v* and Sij is the Kronecker 
symbol. Consequently, a generic vector wGl” can be written as 

n 

w = y 

i = 1 

for a suitable (and unique) set of coefficients Wi £ R. We have 
|| Aw|| 2 = (Aw) T (Aw) 

= [wi(Avi) T + . . . + w„(Av„) T ][wiAvi + . . . + w n Av„] 

= (Aittqvf + . . . + A n W n V^')(AiWiVi + . . . + AnWjjVn) 

= 53 A? w?. 

i= 1 

Denote by \ ma x the largest eigenvalue of A. Since ||w|| 2 = 1 w h we 

conclude that 



|| Aw|| < A moa; ||w|| VwGl". (5.20) 

In a similar manner, we obtain 

l|A _1 w|| < -p— ||w||, 

Amin 

upon recalling that the eigenvalues of A" 1 are the reciprocals of those 
of A. This inequality enables us to draw from (5.19) that 

1I X ~ X H < (5.21) 

||x|| Amin ||x|| 

Using (5.20) once more and recalling that Ax = b, we finally obtain 



x x|| ^ A max ||5b|| 
|| x|| A m in ||b|| 



(5.22) 
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We can conclude that the relative error in the solution depends on the 
relative error in the data through the following constant (> 1) 



K(A) = 



(5.23) 



which is called spectral condition number of the matrix A. K( A) can be 
cond computed in MATLAB using the command cond(A). Other definitions 
for the condition number are available for non symmetric matrices, see 
[QSSOO], Chapter 3. 

Remark 5.2 The MATLAB command cond (A) allows the computation of 
the condition number of any type of matrix A, even those which are not sym- 
condest metric and positive definite. A special command condest(A) is available to 
rcond compute an approximation of the condition number of A, and one rcond(A) 
for its reciprocal, with a substantial saving of floating point operations. If the 
matrix A is ill-conditioned {i.e. K( A) 1), the computation of its condition 

number can be very inaccurate. Consider for instance the tridiagonal matrices 
A n = tridiag(— 1, 2, — 1) for different values of n. A n is symmetric and pos- 
itive definite, its eigenvalues are A j = 2 — 2 cos (j6), for j = 1 ,...,n, with 
0 = 7r/(n + 1), hence K{ A n ) can be computed exactly. In Figure 5.5 we report 
the value of the error Ek(u) = \K(A n ) — cond(A n )|/A(A n ). Note that Ek{u) 
increases when n increases. 




Fig. 5.5. Behavior of Ek(ti) as a function of n (in logarithmic scale) 

A more involved proof would lead to the following more general result 
in the case where S A is an arbitrary symmetric and positive definite 
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matrix “small enough” to satisfy X rnax (SA) < X rniri (A): 

ll x ~*ll < K{ A) / XrnaJM) + jj$b||\ 

ll^ll 1 ^max(3A) / Xmin \ ^mai ||b|| ) 

If K( A) is “small”, that is of the order of the unity, A is said to be 
well conditioned. In that case, small errors in the data will lead to errors 
of the same order of magnitude in the solution. This could not occur in 
the case of ill conditioned matrices. 



Example 5.8 For the Hilbert matrix introduced in Example 5.7, K( A n ) is a 
rapidly increasing function of n. One has K( A4) > 15000, while if n > 13 the 
condition number is so high that MATLAB warns that the matrix is “close to 
singular”. Actually, K(A n ) grows at an exponential rate: K( A n ) ~ e 3 ' 5n ( se e, 
[Hig96]). This provides an indirect explanation of the bad results obtained in 
Example 5.7. 

Inequality (5.22) can be reformulated by the help of the residual r: 

r = b — Ax. (5.24) 



Should x be the exact solution, the residual would be the null vector. 
Thus, in general, r can be regarded as an estimator of the error x — x. 
The extent to which the residual is a good error estimator depends on 
the size of the condition number of A. Indeed, observing that <5b = 
A(x — x) = Ax — b = — r, we deduce from (5.22) that 




< *(A)^ 



(5.25) 



Thus if If (A) is ” small” , we can be sure that the error is small provided 
that the residual is small, whereas this might not be true when K( A) is 
” large”. 



Example 5.9 The residuals associated with the computed solution of the 
linear systems of Example 5.7 are very small (their norms vary between 10“ 16 
and 10~ n ); however the computed solutions differ remarkably from the exact 
solution. 




See the Exercises 5.9-5.10. 
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5.4 How to solve a tridiagonal system 



In many applications (see for instance Chapter 8), we have to solve a 
system whose matrix has the form 

[ ai ci 0 1 



Cn—l 

0 C n CLji 

This matrix is called tridiagonal since the only elements that can be 
non-null belong to the main diagonal and to the first super and sub 
diagonals. 

Assume that the matrix coefficients are real numbers. Then if the 
Gauss factorization LU of A exists, the factors L and U must be bidiag- 
onals (lower and upper, respectively), more precisely: 



1 


0 ■ 




" OLl 


Cl 


0 


ft 


1 


, U = 




OL2 














Cn— 1 


0 


Pn 1 . 




0 




Ctn 



The unknown coefficients a* and can be determined by requiring that 
the equality LU = A holds. This yields the following recursive relations 
for the computation of the L and U factors: 

e i 

<*i=ai, Pi = ——, oii = ai — (3iCi-i, i = 2, ...,n. (5.26) 

Using (5.26), we can easily solve the two bidiagonal systems Ly = b and 
Ux = y, to obtain the following formulae: 

(Ly = b) yi=bi, y i = b i -p i y i ^ 1 , i = 2,...,n, (5.27) 

(Ux = y) x n = — , Xi = (yi - ax i+ 1 ) /at, i = n- 1, . . . , 1. (5.28) 

This is known as the Thomas algorithm and allows the solution of the 
original system with a computational cost of the order of n operations, 
spdiags The MATLAB command spdiags allows the construction of a tridiag- 
onal matrix. For instance, the commands 
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» b=ones(10,l); a=2*b; c=3*b; 

>> T=spdiags([b a c] f -l:l,10,10); 

compute the tridiagonal matrix T 6 R 10xl ° with elements equal to 2 on 
the main diagonal, 1 on the first subdiagonal and 3 on the first super- 
diagonal. 

Note that T is stored in a sparse mode , according to which the only 
elements stored are those different than 0. When a system is solved 
by invoking the command \, MATLAB is able to recognize the type of 
matrix (in particular, whether it has been generated in a sparse mode) 
and select the most appropriate solution algorithm. In particular, when A 
is a tridiagonal matrix generated in sparse mode, the Thomas algorithm 
is the selected algorithm. 

Example 5.10 Let us solve the linear system Tx = b where T is a tridiagonal 
matrix constructed as previously with a variable size n, and b is chosen such 
that the exact solution isx = (l,...,l) T . With the following instructions we 
solve this system (by invoking the \ command) for increasing values of n and 
compute the number of ops divided by n. This ratio for n very large tends to 
a constant, proving that the number of operations scales linearly with n, as 
expected. 

>> ratio = [ ]; 

» for k = 50:25:200 

b=ones(k,l); a=2*b; c=3*b; 

T=spdiags([b a c],-l:l,k t k); 
rhs = T*ones(k,l); 

flops(0); x=T\rhs; ratio = [ratio, flops/k] ; 
end 

>> ratio = 

Columns 1 through 7 

24.5000 24.6133 24.7300 24.7520 24.8067 24.7771 24.8150 



Let us summarize 



1. The factorization LU of A consists of computing a lower triangular 
matrix L and an upper triangular matrix U such that A = LU; 

2. the factorization LU, provided it exists, is not unique. However, 
it can be determined unequivocally by furnishing an additional 
condition such as, e.g ., setting the diagonal elements of L equal to 
1. This is called Gauss factorization; 

3. the Gauss factorization exists if and only if the principal subma- 
trices of A of order 1, 2, . . . , (n - 1) are non singular (otherwise 
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at least one pivot element is null); further, it is unique provided 
det(A) ^ 0; 

4. if a null pivot element is generated, a new pivot element can be 
obtained by exchanging in a suitable manner two rows (or columns) 
of our system. This is the pivoting strategy; 

5. the computation of the Gauss factorization requires about 2n 3 /3 
operations, and only an order of n operations in the case of tridi- 
agonal systems; 

6. for symmetric and positive definite matrices we can use the Cholesky 
factorization A = HH T , where H is a lower triangular matrix, and 
the computational cost is of the order of n 3 /3 operations; 

7. the sensitivity of the result to perturbation of data depends on 
the condition number of the system matrix; more precisely, the 
accuracy of the computed solution can be low for ill conditioned 
matrices. 



5.5 Iterative methods 



An iterative method for the solution of the linear system (5.1) consists 
of setting up a sequence of vectors x^) e R n that converge to the exact 
solution x, that is 



lim x( fc) = x, (5.29) 

k — ► OG 

for any given initial vector x(°) E M n . A possible strategy able to realize 
this process can be based on the following recursive definition 

x (fc+!) = Bx (fc) + g, k > 0, (5.30) 

where B is a suitable matrix (depending on A) and g is a suitable vector 
(depending on A and b), which must satisfy the relation 

x = Bx + g. (5.31) 

Since x = A _1 b this yields g = (I — B)A _1 b. 

Let e W = x — x.^ define the error at the fc-th step. By subtracting 
(5.30) from (5.31), we obtain 

e ( fc+1) = Be (fc) . 
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For this reason B is called the iteration matrix associated with (5.30). If 
B is symmetric and positive definite, by (5.20) we have 

||e( fc+1) || = ||Be( fe) || < />(B)||e< fc >||, Vk > 0. 

We have denoted by p(B) the spectral radius of B, that is, the max- 
imum modulus of eigenvalues of B. By iterating the same inequality 
backward, we obtain 

||e<*)|| < [p(B)] fc ||e (0 >||, k >0. (5.32) 

Thus e (k> — > 0 as k — > oo for every possible e 1 - 0 - 1 (and henceforth ) 
provided that p(B) < 1. Actually, this property is also necessary for 
convergence. 

Should, by any chance, an approximate value of p(B) be available, 
(5.32) would allow us to deduce the minimum number of iterations 
that are needed to damp the initial error by a factor e. Indeed, kmin 
would be the lowest positive integer for which [p(B)] fcmin < e. 

In conclusion, the following result holds: 

Proposition 5.2 For an iterative method of the form (5.30) whose iter- 
ation matrix satisfies (5.31), convergence for any x(°) holds iff p(B) < 1. 
Moreover, the smaller p{ B), the fewer the number of iterations necessary 
to reduce the initial error by a given factor. 



5.5.1 How to construct an iterative method 



A general technique to devise an iterative method is based on a splitting 
of the matrix A, A = P — (P — A), being P a suitable nonsingular matrix 
(called the preconditioner of A). Then 

x = P -1 (P - A)x + P -1 b, 

which has the form (5.31) provided that we set B = P -1 (P — A) = 
I — P _1 A and g = P -1 b. Correspondingly, we can define the following 
iterative method: 

P(x( fc+1 > -x (fc >) =r (k \ k> 0, 

where = b — Ax( fc ) denotes the residual vector at the fc-th iteration. 
A generalization of this iterative method is the following 



P(x (fc+1) -x( fc )) = a fc r (fc) , 



k > 0 



(5.33) 
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where a,k / 0 is a parameter that may change at every iteration k. 

The method (5.33) requires at each step the solution of the linear 
system 



p z (fc) =r (fe) , (5.34) 

then the new iterate is defined by x (fc+ b = +o/kZ^ k K For that reason 
the matrix P ought to be chosen in such a way that the computational 
cost for the solution of (5.34) be quite low (e.p., every P either diagonal 
or triangular or tridiagonal will serve the purpose). Let us now consider 
some special instance of iterative methods which take the form (5.33). 

The Jacobi method 

If the diagonal entries of A are nonzero, we can set P = D, where D 
is the diagonal matrix of the diagonal entries of A. The Jacobi method 
corresponds to this choice with the assumption = 1 for all k. Then 
from (5.33) we obtain 

Dx (fc+1) = b - (A - D)x (fe) , k > 0, 



or, componentwise, 




(5.35) 



where k > 0 and x^ 0 ) = (^°\ x£\ . . . , Xn^) T is the initial vector. 



The iteration matrix is therefore 










0 


—0,12/0,11 


• o,\ n / an 




B = D _1 (D — A) = 


— « 2 l /^22 


0 


— a 2 n/ 0,22 


(5.36) 




ttnl/ttnn 


^77,2/ 0 ”nn 


0 





The following result allows the verification of Proposition 5.2 without 
explicitly computing p(B): 



Proposition 5.3 If the matrix A. is strictly diagonally dominant by row , 
then the Jacobi method converges. 
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As a matter of fact, we can verify that p( B) < 1, where B is given in 
(5.36). To start with, we note that the diagonal elements of A are nonnull 
owing to the strict diagonal dominance. Let A be a generic eigenvalue of 
B and x an associated eigenvector. Then 

n 

^2 bij x j — Ax*, i = 1, . . • , n. 

3 = 1 

Assume for simplicity that maxk=i,...,n \ x k\ = 1 (this is not restrictive 
since an eigenvector is defined up to a multiplicative constant) and let 
Xi be the component whose modulus is equal to 1. Then 



|A| = 


n 


— 


n 

y! K X 3 


VI 


a ij 




3 = 1 








^ll 



having noticed that B has only null diagonal elements. Therefore |A| < 1 
thanks to the assumption made on A. 

The Jacobi method is implemented in the Program 8 setting in the 
input parameter P= ’ J ’ . Input parameters are: the system matrix A, the 
right hand side b, the initial vector xO and the maximum number of 
iterations allotted, nmax. The iterative procedure is terminated as soon 
as the ratio between the Euclidean norm of the current residual and 
that of the initial residual is less than a prescribed tolerance tol (for a 
justification of this stopping criterion, see Section 5.6). 



Program 8 - itermeth: General iterative method 

function [x, iter]= itermeth(A,b,xO, nmax, tol, P) 

%ITERMETH General iterative method 

% X = ITERMETH(A,B,XO, NMAX, TOL, P) attempts to solve the system of 
% linear equations A*X=B for X. The N-by-N coefficient matrix A 
% must be not singular and the right hand side column vector B must 
% have length N. If P=’J’ the Jacobi method is used, if P=’G’ the 
% Gauss-Seidel method is selected. Otherwise, P is a N-by-N matrix 
% that play the role of a preconditioner. TOL specifies the tolerance 
% of the method. NMAX specifies the maximum number of iterations. 
[n,n]=size(A); 
if nargin == 6 
if ischar(P)==l 
if P==T 
L = diag(diag(A)); 

U = eye(n); beta = 1; alpha = 1; 
elseif P == ’G’ 

L - tril(A); 
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U = eye(n); beta = 1; alpha = 1; 
end 
else 

[L f U]=lu(P); beta = 0; 
end 
else 

L = eye(n); U = L; beta = 0; 
end 

iter = 0; 
r = b - A * xO; 
rO = norm(r); 
err = norm (r); x = xO; 
while err > tol & iter < nmax 
iter = iter + 1; 
z = L\r; 
z = U\z; 
if beta == 0 

alpha = z’*r/(z’*A*z); 
end 

x = x + alpha*z; 
r = b - A * x; 
err = norm (r) / rO; 
end 



The Gauss-Seidel method 

When applying the Jacobi method, each individual component of the 
new vector, say x\ k+1 \ is computed independently of the others. This 
may suggest that a faster convergence could be (hopefully) achieved if 
the new components already available x^ k+1 \ j = l,...,i — 1, together 

with the old ones Xj k \ j > i, are used for the calculation of x\ k+1 K This 
would lead to modifying (5.35) as follows: for k > 0 (still assuming that 
an 7^ 0 for i = 1 , . . . , n) 




(5.37) 



The updating of the components is made in sequential mode, whereas 
in the original Jacobi method it is made simultaneously (or in parallel). 
The new method, which is called the Gauss-Seidel method , corresponds 
to the choice P = D — E and = 1, k > 0, in (5.33), where E is a lower 
triangular matrix whose non null entries are = — a^, i — 2 , . . . ,n, 
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j = 1 , . . . , z — 1. The corresponding iteration matrix is then 
B = (D — E) -1 (P — A). 

A possible generalization is the so-called relaxation method in which 
P = — E, where uj ^ 0 is the relaxation parameter, and ctk = 1, 

k > 0 (see Exercise 5.13). 

Also for Gauss-Seidel method there exist special matrices A whose 
associated iteration matrices satisfy the assumptions of Proposition 5.2 
(those guaranteeing convergence). Among them let us mention: 

1. matrices strictly diagonally dominant by row; 

2. matrices which are symmetric and positive definite. 

The Gauss-Seidel method is implemented in the Program 8 setting the 
input parameter P equal to J G * . 

There are no general results stating that the Gauss-Seidel method 
converges faster than Jacobi’s. However, in some special instances this 
is the case, as stated by the following proposition: 

Proposition 5.4 Let A be a tridiagonal nxn nonsingular matrix whose 
diagonal elements are all nonnull. Then the Jacobi method and the Gauss- 
Seidel method are either both divergent or both convergent. In the latter 
case , the Gauss-Seidel method is faster than Jacobi ’s; more precisely the 
spectral radius of its iteration matrix is equal to the square of that of 
Jacobi. 

Example 5.11 Let us consider a linear system Ax = b where b is chosen in 
such a way that the solution is the unit vector (1,1,...,1) T and A is the 10 x 10 
tridiagonal matrix whose diagonal entries are all equal to 3, the entries of the 
first lower diagonal are equal to —2 and those of the upper diagonal are all equal 
to —1. Both Jacobi and Gauss-Seidel methods converge since the spectral radii 
of their iteration matrices are strictly less than 1. More precisely, by starting 
from a null initial vector and setting tol =10“ 12 , the Jacobi method converges 
in 277 iterations while only 143 iterations are requested from Gauss-Seidel’s. 
To get this result we have used the following instructions: 

>> n=10; A = 3*eye(n) - 2*diag(ones(n-l,l),l) - diag(ones(n-l,l),-l); 

>> b=A*ones(n,l); 

>> [x,iter]=itermeth(A,b,zeros(n,l),400,l.e-12/J’); iter 
iter = 

277 

>> [x,iter]=itermeth(A,b,zeros(n,l),400,l e-12,’G'); iter 
iter = 

143 
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See the Exercises 5.11-5.14. 



5.6 When should an iterative method 
be stopped? 



In theory iterative methods require an infinite number of iterations to 
converge to the exact solution. In practice, this is neither reasonable nor 
necessary. Indeed we do not really need to achieve the exact solution, 
but rather an approximation x.( k>} for which we can guarantee that the 
error be lower than a desired tolerance e. On the other hand, since the 
error is itself unknown (as it depends on the exact solution), we need 
a suitable a posteriori error estimator which predicts the error starting 
from quantities that have already been computed. 

The first type of estimator is represented by the residual which is 
defined as 



r (fe) =b-Ax (fc) , 

being x (fc ^ the approximate value of the solution at the k - th iteration. 

More precisely, we could stop our iterative method at the first iteration 
step kmin for which 



|| r (fcmin)|| < £ ||b||. 

Setting x = x.^ kmin ^ and r = r^ krnin>) in (5.25) we would obtain 



||e( fcmin ) 

l|x|| 



< eK( A), 



which is an estimate for the relative error. We deduce that the control 
on the residual is meaningful only for those matrices whose condition 
number is reasonably small. 



Example 5.12 Let us consider the linear system (5.1) where A=A 2 o is the 
Hilbert matrix of dimension 20 introduced in Example 5.7 and b is constructed 
in such a way that the exact solution is x = (1, 1 , ... , 1) T . Since A is sym- 
metric and positive definite the Gauss-Seidel method surely converges. We use 
Program 8 to solve this system taking xO to be the null initial vector and set- 
ting a tolerance on the residual equal to 10~ 5 . The method converges in 472 
iterations; however the relative error is very large and equals 0.26. This is due 
to the fact that A is extremely ill conditioned, having K(A) ~ 10 17 . In Figure 
5.6 we show the behavior of the residual (normalized to the initial one) and 
that of the error as the number of iterations increases. 
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Fig. 5.6. Behavior of residual (dashed line) and error (solid line) for 
Gauss-Seidel iterations applied to the system of Example 5.12 



An alternative approach is based on the use of a different error estima- 
tor, namely the increment S^ kmin ^ = x ( fcrriin+1 ) — x.( krnin \ More precisely, 
we can stop our iterative method at the first iteration step for which 

||^(fcmi„)|| < £ || b ||_ 



In the special case in which B is symmetric and positive definite, we have 
||e< fc >|| = ||e( fc+1 > - <5 (fe) || < p(B)||e< fc >|| + \\S {k) \\. 



Since p(B) should be less than 1 in order for the method to converge, we 
deduce 




(5.38) 



From the last inequality we see that the control on the increment is 
meaningful only if p{ B) is much smaller than 1 since in that case the 
error will be of the same size as the increment. 

In fact, the same conclusion holds even if B is not symmetric and 
positive definite (as it occurs for the Jacobi and Gauss-Seidel methods); 
however in that case (5.38) is no longer true. 

Example 5.13 Let us consider a system whose matrix Ae R 50x50 i s tridiago- 
nal and symmetric with entries equal to 2.001 on the main diagonal and equal 
to 1 on the two other diagonals. As usual, the right hand side b is chosen in 
such a way that the unit vector (1, . . . , 1) T is the exact solution. Since A is 
tridiagonal with strict diagonal dominance, the Gauss-Seidel method will con- 
verge about twice as fast as the Jacobi method (in view of Proposition 5.4). 
Let us use Program 8 to solve our system in which we replace the stopping 
criterion based on the residual by that based on the increment. Using a null 
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initial vector and setting the tolerance tol= 10 — 5 , after 1604 iterations the 
program returns a solution whose error 0.0029 is quite large. The reason is 
that the spectral radius of the iteration matrix is equal to 0.9952, which is 
very close to 1. Should the diagonal entries be set equal to 3, after only 17 
iterations we would have obtained an error equal to 10 -5 . In fact in that case 
the spectral radius of the iteration matrix would be equal to 0.428. 



5.7 Richardson method 



Let us now consider methods (5.33) for which the acceleration parame- 
ters ctk are non-null. We call stationary the case when ak = ol (a given 
constant) for any k > 0, dynamic the case in which ak may change along 
the iterations. In this framework the non singular matrix P is still called 
a preconditioner of A. 

The crucial issue is the way the parameters are chosen. In this respect, 
the following result holds (see, e.g ., [QV94, Chap. 2], [Axe94]). 



Proposition 5.5 If both P and A are symmetric and positive definite, 
the stationary Richardson method converges for every possible choice of 
x<°) iff 0 < a < 2/Xmax, where A max(> 0) is the maximum eigenvalue 
of P -1 A. Moreover, the spectral radius p(B Q ) of the iteration matrix 
B a = I — oP _1 A is least when a = a opt , where 



®-opt 



2 



A min -f- A 



max 



(5.39) 



A min being the minimum, eigenvalue of P _1 A. 

Under the same assumption on P and A, the dynamic Richardson 
method converges if for instance ak is chosen in the following way: 



( z (fc))T r (fc) 

(z( fc )) T AzW 



Vfc > 0 



where z ^ = P~ 1 P fc - ) is the preconditioned residual. The method (5.33) 
with this choice of ak is called the preconditioned gradient method, or 
simply the gradient method when the preconditioner P is the identity 
matrix. 

In both cases, the following convergence estimate holds: 



a (*)| 



A < 



(K(R- l k) - 1 
\AT(P -1 A) + 1 




k > 0, 



(5.40) 
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where ||v||a = vV T Av, Vv G R n , is the so-called energy norm associ- 
ated with the matrix A. 

The dynamic version should therefore be preferred to the stationary 
one since it does not require the knowledge of the extreme eigenvalues 
of P -1 A. Rather, the parameter ak is determined in terms of quantities 
which are already available from the previous iteration. 

We can rewrite the preconditioned gradient method more efficiently 
through the following algorithm (derivation is left as an exercise): given 
x(°) , for every k > 0 do 



p z ( fc ) 


- r (fc) 




( z (fc))T r (fc) 


Oik = 


(z(*)) T Az(*)’ 


x (fc+l) 


= 4- akZ^ k \ 


r (fc+l) 


= r — afcAz^) 



(5.41) 



The same algorithm can be used to implement the stationary Richard- 
son method by simply replacing ak with the constant value a. 

From (5.40), we deduce that if P -1 A is ill conditioned the convergence 
rate will be very low even for a = a op t (as in that case p(B aopt ) ~ 1). 
This circumstance can be avoided provided that a convenient choice of 
P is made. This is the reason why P is called the preconditioner or the 
preconditioning matrix. 

If A is a generic matrix it may be a difficult task to find a precon- 
ditioner which guarantees an optimal trade-off between damping the 
condition number and keeping the computational cost for the solution 
of the system (5.34) reasonably low. 

The dynamic Richardson method is implemented in Program 8 where 
the input parameter P stands for the preconditioning matrix (when not 
prescribed, the program implements the unpreconditioned method by 
setting P=I). 

Example 5.14 This example, of theoretical interest only, has the purpose 
of comparing the convergence behavior of Jacobi, Gauss-Seidel and gradient 
methods applied to solve the following (micro) linear system: 

2xi+x 2 = 1, x\ + 3^2 = 0 (5.42) 

with initial vector = (1, 1/2) T . Note that the system matrix is symmetric 
and positive definite, and that the exact solution is x = (3/5, — 1/5) T . We 
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report in Figure 5.7 the behavior of the relative residual E ^ = ||r^ ||/||r^ || 
for the three methods above. Iterations are stopped at the first iteration kmin 
for which E^ krnin ^ < 10 -14 . The gradient method appears to converge the 
fastest. 




Fig. 5.7. Convergence history for Jacobi, Gauss-Seidel and gradient methods 
applied to system (5.42) 



Example 5.15 Let us consider a system Ax = b where A 6 R 100xl0 ° i s a 
pentadiagonal matrix whose main diagonal has all entries equal to 4, while 
the first and third lower and upper diagonals have all entries equal to — 1 . As 
customary, b is chosen in such a way that x = (1,...,1) T is the exact solution 
of our system. Let us use Program 8 which implements the preconditioned 
Richardson method. We fix tol=l.e-05, nmax=5000, x0=zeros(100, 1) and 
the preconditioner P is a tridiagonal matrix whose diagonal elements are all 
equal to 2, while the elements on the lower and upper diagonal are all equal to 
— 1. Both A and P are symmetric and positive definite. The method converges 
in 18 iterations, whereas Program 8 (implementing the Gauss-Seidel method) 
requires as many as 2421 iterations before fulfilling the same stopping criterion. 

Example 5.16 (Direct and iterative methods) Let us go back to Exam- 
ple 5.7 on the Hilbert matrix and solve the system (for different values of n) 
by the preconditioned gradient method using the diagonal preconditioner D, 
made of the diagonal entries of the Hilbert matrix. We report also the results 
that would be obtained using a more efficient iterative method, the conjugate 
gradient method (CG). The CG method will be addressed in Section 5.8. We 
fix x^ to be the null vector and iterate untill the relative residual is less than 
10 -6 . In Table 5.1 we report the absolute errors (with respect to the exact 
solution) that are obtained with the above iterative methods and those that 
would be obtained by using the Gauss LU factorization approach. The error 
degenerates when n gets large and the direct method is used. On the other 
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hand, we can appreciate the benefical effect that a suitable iterative method 
as the CG scheme, can have on the number of iterations. 





LU 


PG 


PCG 


n 


K{ A) 


Error 


Error 


Iter. 


Error 


Iter. 


4 


1.55e+04 


3.90e-13 


1.74-02 


995 


2.24e-02 


3 


6 


1.50e+07 


2.62e-10 


8.80e-03 


1813 


9.50e-03 


9 


8 


1.53e+10 


1.34e-07 


1.78e-02 


1089 


2.13e-02 


4 


10 


1.60e+13 


7.60e-04 


2.52e-03 


875 


6.98e-03 


5 


12 


1.67e+16 


4.97e-01 


1.76e-02 


1355 


1.12e-02 


5 


14 


3.04e+17 


6.65e+00 


1.46e-02 


1379 


1.61e-02 


5 



Tab. 5.1. Errors obtained using the preconditioned gradient method (PG), 
the preconditioned conjugate gradient method (PCG) and the Gauss LU fac- 
torization method for the solution of the Hilbert system 



See the Exercises 5.15-5.17. 



Let us summarize 



1. An iterative method for the solution of a linear system starts from 
a given initial vector x(°) and builds up a sequence of vectors x^) 
which we require to converge to the exact solution as k —> oo; 

2. an iterative method converges for every possible choice of the initial 
vector x(°) iff the spectral radius of the iteration matrix is strictly 
less than 1; 

3. classical iterative methods are those of Jacobi and Gauss-Seidel. 
A sufficient condition for convergence is that the system matrix 
be strictly diagonally dominant by row (or symmetric and definite 
positive in the case of Gauss-Seidel); 

4. in the Richardson method convergence is accelerated thanks to the 
introduction of a parameter and (possibly) a convenient precondi- 
tioning matrix; 

5. there are two possible stopping criteria for an iterative method: 
controlling the residual or controlling the increment. The former 
is meaningful if the system matrix is well conditioned, the latter if 
the spectral radius of the iteration matrix is not close to 1. 
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Peg 

gmres 



luinc 

choline 



magic 



5.8 What we haven't told you 



Several efficient variants of the Gauss LU factorization are available for 
sparse systems of large dimension. Among the most advanced, we quote 
the so-called multifrontal method which makes use of a suitable reorder- 
ing of the system unknowns in order to keep the triangular factors L and 
U as sparse as possible. More on this issue is available on [GL89], [DD95] 
and [Iro70]. 

Concerning iterative methods, there exists a broad family of modern 
methods, the so-called Krylov methods , which are more efficient than 
those presented above. Some of them feature the notable property of 
finite termination, that is, in exact arithmetic they provide the exact 
solution in a finite number of iterations. We can mention the conjugate 
gradient method (which can be applied if the system matrix is symmetric 
and positive definite) and the GMRES (Generalized Minimum RESid- 
ual). Their description is provided, e.g ., in [Axe94] and [Saa96]. They 
are available in the MATLAB toolbox sparfun under the name of peg 
and gmres. 

As already pointed out, iterative methods converge slowly if the system 
matrix is severely ill conditioned. Several preconditioning strategies have 
been developed (see, e.g., [dV89]). Some of them are purely algebraic, 
that is, they are based on incomplete (or inexact) factorizations of the 
given system matrix, and are implemented in the MATLAB functions 
luinc or choline. Other strategies are developed ad hoc by exploiting 
the meaning and the structure of the problem which has generated the 
linear system at hand. 

Finally it is worthwhile mentioning the multigrid algorithms which 
are based on the sequential use of a hierarchy of systems of variable 
dimensions that “resemble” the original one, allowing a clever strategy 
of reduction of the error (see, e.g., [Hac85], [Wes91] and [Hac94]). 



5.9 Exercises 



Exercise 5.1 For a given matrix A £ R nXn find the number of operations (as 
a function of n) that are needed for computing its determinant by the recursive 
formula (1.8). 

Exercise 5.2 Use the MATLAB command magic (n) , n=3, 4, . . . , 500, to con- 
struct the magic squares of order n, that is, those matrices having entries for 
which the sum of the elements by rows, columns or diagonals are identical. 
Then compute their determinants by the command det introduced in Section 
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1.3 and the CPU time that is needed for this computation using the com- 
mand cputime. Finally, approximate these data by the least-square method 
and deduce that the CPU time scales approximately as n 3 . 

Exercise 5.3 Find for which values of e the matrix defined in (5.12) does not 
satisfy the hypotheses of Proposition 5.1. For which value of e does this matrix 
become singular? Is it possible to compute the LU factorization in that case? 

Exercise 5.4 Verify that the number of operations necessary to compute the 
LU factorization of a square matrix A of dimension n is approximately 2n 3 /3. 

Exercise 5.5 Show that the LU factorization of A can be used for computing 
the inverse matrix A -1 . (Observe that the j - th column vector of A -1 satisfies 
the linear system Ayy = ej , e j being the vector whose components are all null 
except the j - th component which is 1.) 

Exercise 5.6 Compute the factors L and U of the matrix of Example 5.6 and 
verify that the LU factorization is inaccurate. 

Exercise 5.7 Explain why partial pivoting by row is not convenient for sym- 
metric matrices. 



Exercise 5.8 Consider the linear system Ax = b with 



A = 



2 -2 0 
e-2 2 0 

0 -13 



and b such that the corresponding solution is x = (1,1, 1) T and e is a positive 
real number. Compute the Gauss factorization of A and note that Z 32 — > 00 
when £ — > 0. In spite of that, verify that the computed solution is accurate. 



Exercise 5.9 Consider the linear systems A;X; = b*, i = 1,2,3, with 



A x 



15 6 8 11 

6 6 5 3 

8 5 7 6 

11 3 6 9 



A* = (Ai)\ 2 = 2,3, 



and hi such that the solution is always x; = (1, 1, 1, 1) T . Solve the system by 
the Gauss factorization using partial pivoting by row, and comment on the 
obtained results. 



Exercise 5.10 Show that for a symmetric and positive definite matrix A we 
have K( A 2 ) = (K( A)) 2 . 
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Exercise 5.11 Analyse the convergence properties of the Jacobi and Gauss- 
Seidel methods for the solution of a linear system whose matrix is 



A = 



a 0 1 

0 a 0 

1 0 a 



a e R. 



Exercise 5.12 Provide a sufficient condition on (3 so that both the Jacobi 
and Gauss-Seidel methods converge when applied for the solution of a system 
whose matrix is 



A = 



-10 2 

P 5 



(5.43) 



Exercise 5.13 For the solution of the linear system Ax = b with A E M nXn , 
consider the relaxation method : given x^ = . . . , Xn°^) T , for k = 0, 1, . . . 

compute 



r (*0 _ , 



i — 1 
3 = 1 



r ,(^ + 1 ) 



Y, a ijxf\ 



X. 



(*+l) 



(1 - u;)x- fc) + lj 






for i = 1, ... ,n, where u is a real parameter. Find the explicit form of the 
corresponding iterative matrix, then verify that the condition 0 < oj < 2 is 
necessary for the convergence of this method. Note that if uj = 1 this method 
reduces to the Gauss-Seidel method. If 1 < cu < 2 the method is known as 
SOR ( successive over-relaxation). 



Exercise 5.14 Consider the linear system Ax = b with A = 



3 2 
2 6 



and 



say whether the Gauss-Seidel method converges, without explicitly computing 
the spectral radius of the iteration matrix. 



Exercise 5.15 Compute the first iteration of the Jacobi, Gauss-Seidel and 
preconditioned gradient method (with preconditioner given by the diagonal of 
A) for the solution of system (5.42). 



Exercise 5.16 Prove (5.39), then show that 



p(Ba op t) 



+ At7 



K{ P“ X A) - 1 
^(P-iAJ + l 



(5.44) 



Exercise 5.17 Let us consider a set of n = 20 factories which produce 20 
different goods. With reference to the Leontief model introduced in Problem 
5.3, suppose that the matrix C has the following integer entries: Cij = i + j~ 1 
for z, j = 1 ,...,7i, while hi = z, for i = 1, . . . , 20. Is it possible to solve this 
system by the gradient method? Propose a method based on the gradient 
method noting that, if A is non-singular, the matrix A T A is symmetric and 
positive definite. 
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Eigenvalues and eigenvectors 



We consider the following problem: given a square matrix A G M nxn , 
find a scalar A (real or complex) and a non-null vector x such that 



Ax = Ax. (6.1) 

Any such A is called an eigenvalue of A, while x is the associated eigen- 
vector. The latter is not unique; indeed all its multiples ax with a / 0, 
real or complex, are also eigenvectors associated with A. Should x be 
available, A can be recovered by using the Rayleigh quotient x*Ax/||x|| 2 , 
x* being the vector whose i-th component is equal to xi . 

A number A is an eigenvalue of A if it is a root of the following poly- 
nomial of degree n (called the characteristic polynomial of A): 

p A (A) = det(A- AI). 

Consequently, a square matrix of dimension n has exactly n eigen- 
values (real or complex), not necessarily distinct. Also, if A has real 
entries, p A ( A) has real coefficients, and therefore complex eigenvalues of 
A necessarily occur in complex conjugate pairs. 

Problem 6.1 (Dynamics) Consider the system of Figure 6.1 made of two 
point wise bodies Pi and P2 of mass m, connected by 2 springs and free to 
move along the line joining Pi and P2. Let Xi(t ) denote the position occupied 
by Pi at time t for i — 1, 2. Then from the second law of dynamics we obtain 



m Xi= K(x 2 — #i) — Kx 1, m x 2 = K(x 1 — x 2 ). 

We are interested in forced oscillations whose corresponding solution is Xi = 
ai sin (cut + </>), i = 1, 2, with a* ■=/=- 0. In this case we find that 



—maiuj 2 = K(a 2 — a\) — Ka\, 
—ma 2 uu 2 = K(a± — 02). 



( 6 . 2 ) 
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This is a 2 x 2 homogeneous system which has a non-trivial solution ai, a 2 iff 
the number A = muj 2 /K is an eigenvalue of the matrix 




With this definition of A, (6.2) becomes Aa = Aa. Since pa (A) = (2 — A)(l — 
A) — 1, the two eigenvalues are Ai ~ 2.618 and A 2 — 0.382 and correspond 
to the frequencies of oscillation c a — yj KXi/m which are admitted by our 
system. • 





^i(i) 




£2 (0 






A 


x 




. 


l A A # 




VV 


Pi 


py p 2 x 



Fig. 6.1. The system of two pointwise bodies of equal mass connected by 
springs 



Problem 6.2 (Demography) Several mathematical models have been pro- 
posed in order to predict the evolution of certain species (either human or 
animal). The simplest population model, which was introduced in 1920 by 
Lotka and formalized by Leslie 20 years later, is based on the rate of mortality 
and fecundity for different age intervals, say % — 0, . . . , n. Let x denote the 
number of females (males don’t matter in this context) whose age at time t 
falls in the z-th interval. The values of x ^ are given. Moreover, let Si denote 
the rate of survival of the females belonging to the z-th interval, and m* the 
average number of females generated from a female in the z-th interval. 

The model by Lotka and Leslie is described by the set of equations 



T (*+ 1 ) 
x i + 1 




n 



X 



O+i) 

0 




TYli. 



i = 0 



z = 0, . . . , n — 1, 



The first n equations describe the population development, the last its repro- 
duction. In matrix form we have 



x (t+1) = Ax (i) , 
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where = (xq\ . . . , and A is the Leslie matrix : 



mo m i 
so 0 



A = 



0 



si 



m n 

0 



L 0 0 0 s n — i 0 J 



We will see in Section 6.1 that the dynamics of this population is determined 
by the eigenvalue Ai of A of maximum modulus, whereas the distribution of 
the individuals in the different age intervals (normalized with respect to the 
whole population), is obtained as the limit of for t —> oo and satisfies 
Ax = Aix. This problem will be solved in Exercise 6.2. • 



Problem 6.3 (Interurban viability) Let us consider n cities and let A be 
a matrix whose entry aij is equal to 1 if the i-th city is connected directly 
to the j - th city, and 0 otherwise. One can show that the components of the 
eigenvector x (of unit length) associated with the maximum eigenvalue pro- 



vides the accessibility rate (which is 
various cities. Consider for instance 
main cities of Lombardy in Northern 
components of x in this case are 



0.5271 


(Milan) 


0.2165 


(Lodi) 


0.4690 


(Bergamo) 


0.1590 


(Varese) 


0.0856 


(Sondrio) 


0.0575 


(Mantua) 



a measure of the ease of access) to the 
the railway network connecting the 11 
Italy, see Figure 6.2. The moduli of the 



0.1590 


(Pavia) 


0.3579 


(Brescia) 


0.3861 


(Como) 


0.2837 


(Lecco) 


0.1906 


(Cremona^ 



We can see that the cities better connected are (in decreasing order) Milan, 
Bergamo, Como and Brescia, while Mantua is the worst connected. Note that 
in spite of the fact that Pavia, Varese, Mantua and Sondrio have the same 
number of accesses (one), their accessibility rate is different. Obviously, this 
analysis does not take account at all of the frequency of the connections but 
merely the existence of the link between the different cities. • 



In the special case where A is either diagonal or triangular, its eigenval- 
ues are nothing but its diagonal entries. However, if A is a general matrix 
and its dimension n is sufficiently large, seeking the zeros of px (A) is not 
a convenient approach. Ad hoc algorithms are better suited, and one of 
them is described in the next section. 
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1 Milan 

2 Pavia 

3 Lodi 

4 Brescia 

5 Bergamo 

6 Como 

7 Varese 

8 Lecco 

9 Sondrio 

10 Cremona 

11 Mantua 



Fig. 6.2. A schematic representation of the railway network between the main 
cities of Lombardy 



6.1 The power method 



As noticed in Problems 6.2 and 6.3, the knowledge of the whole spectrum 
of A (that is the the set of all its eigenvalues) is not always required. 
Often, only the extremal eigenvalues matter, that is, those having largest 
and smallest modulus. 

For instance, suppose that A is a square matrix of dimension n, with 
real entries, and assume that its eigenvalues can be ordered as 

|Ai| > |A 2 | > |A 3 | > . . . > |A n |. (6.3) 



Then, in particular, Ai is distinct from the other eigenvalues of A. Let 
us indicate by xi the eigenvector (with unit length) associated with 
Ai. If the eigenvectors of A are linearly independent, Ai and xi can be 
computed by the following iterative procedure, commonly known as the 
power method : 

given an arbitrary initial vector x(°) and setting y(°) = x^/||x^ ||, 
compute for k > 1 



x (fc) = Ay (fc-1) > 




= (y( fe )) T Ay( fe ) 



(6.4) 



Note that, by recursion, one finds y (fc * = f] ,k > A k y l(>> where = 
(nf =1 ||x^> H)- 1 for k > 1. The presence of the powers of A justifies the 
name given to this method. 
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As we will see in next section, this method generates a sequence of 
vectors {y^} with unit length which, as k — ► oo, align themselves along 
the direction of the eigenvector xi. The error ||y( fc ) — xi|| is proportional 
to the ratio | A2 / Ai | in the case of a generic matrix, and to | A2 / Ai | 2 when 
the matrix A is symmetric. Consequently one obtains that A ^ — ► Ai for 
k — > 00. 

An implementation of the power method is given in the Program 9. 
The iterative procedure is stopped at the first iteration k when 

|A (fc ) -A< fc-1) | <e|A (fc) |, 

where e is a desired tolerance. The input parameters are the matrix A, 
the initial vector xO, the tolerance tol for the stopping test and the 
maximum admissible number of iterations nmax. Output parameters are 
the maximum modulus eigenvalue lambda, the associated eigenvector 
and the actual number of iterations which have been carried out. 



function [lambda, x,iter]=eigpower(A, tol, nmax, xO) 

%EIGPOWER Numerically evaluate one eigenvalue of a matrix. 

% LAMBDA = EIGPOWER(A) compute with the power method the 
% eigenvalue of A of maximum modulus from an initial guess which by default 
% is an all one vector. 

% LAMBDA = EIGPOWER(A, TOL, NMAX, XO) uses an absolute error 
% tolerance TOL (the default is l.e-6) and a maximum number of iterations 
% NMAX (the default is 100), starting from the initial vector XO. 

% [LAMBDA, V, ITER] - EIGPOWER(A, TOL, NMAX, XO) also returns the eigenvector 
% V such that A*V=LAMBDA*V and the iteration number at which V was computed. 
[n,m] = size(A); 

if n ~= m, error('Only for square matrices'); end 
if nargin == 1 
tol = l.e-06; 
xO = ones(n,l); 
nmax = 100; 
end 

xO = x0/norm(x0); 

pro = A*x0; 

lambda = x0'*pro; 

err = tol*abs(lambda) + 1; 

iter = 0 

while err > tol*abs(lambda) & abs(lambda) ~= 0 & iter <= nmax 
x = pro; x = x/norm(x); 
pro = A*x; lambdanew = x’*pro; 
err = abs(lambdanew - lambda); 
lambda = lambdanew; 



Program 9 - eigpower: power method 
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iter = iter + 1; 
end 



Example 6.1 Consider the family of matrices 



a 


2 


3 


13 


5 


11 


10 


8 


9 


7 


6 


12 


4 


14 


15 


1 



ael. 



We want to approximate the eigenvalue with largest modulus by the power 
method. When a = 30, the eigenvalues of the matrix are given by Ai = 39.396, 
A2 = 17.8208, A3 = —9.5022 and A4 = 0.2854 (only the first four significant 
digits are reported). The method approximates Ai in 22 iterations with a 
tolerance e — 10“ 10 and x^ = 1 T . However, if a = —30 we need as many as 
708 iterations. 

The different behavior can be explained by noting that in the latter case 
one has Ai = -30.643, A 2 = 29.7359, A 3 = -11.6806 and A 4 = 0.5878. Thus, 
| A 2 I/I Ai | = 0.9704, close to unity. 



6.1.1 Convergence analysis 



Since we have assumed that the eigenvectors xi , . . . , x n of A are linearly 
independent, these eigenvectors form a basis for C n . Thus the vectors 
x(°) and y(°) can be written as 

n n 

x(0) = y (0) = /3 (0) y~/*jXi, with /3 (0) = l/||x (0) ||. 

i= 1 i=l 

At the first step the power method gives 

n n 

x (1) = Ay (0) = = p^'Y^ajXjX.i 

i=i i=i 



and, similarly, 



y (1) = ^ (1) 5^ a *AiXi, Y = 



1 



i= 1 



c<°) II llx(D 



At the generic fc-th step we will have 

/?W = ||x.°»||. 1 .|lx«)l 



i= 1 
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and therefore 

y ( *) =Af/3( fc ) ^aiX!+ 

Since | A^/Ai | < 1 Vi = 2, . . . , n, the vector tends to align along the 
same direction as the eigenvector xi when k tends to + oo, provided «i ^ 
0. The condition on ai, which is impossible to ensure in practice since 
xi is unknown, is in fact not restrictive. Actually, the effect of roundoff 
errors is the appearance of a non-null component along the direction of 
xi , even though this was not the case for the initial vector x^ 0 ) . (We can 
say that this is one of the rare circumstances where roundoff errors help 
us!) 

Example 6.2 Consider the matrix A(a) of Example 6.1, with a = 16. The 
eigenvector xi of unit length associated with Ai is (1/2, 1/2, 1/2, 1/2) T . Let us 
choose (on purpose!) the initial vector (2, —2,3, — 3) T , which is orthogonal to 
xi. We report in Figure 6.3 the cosine of the angle contained between and 
xi. We can see that after about 30 iterations of the power method the cosine 
tends to —1 and the angle tends to 7r, while the sequence A^ approaches 
Ai = 34. The power method has therefore generated, thanks to the roundoff 
errors, a sequence of vectors y ^ whose component along the direction of xi 
is increasingly relevant. 






Fig. 6.3. The cosine of the angle contained between y ^ and xi (left), and 
the value of A^ fc) (right), for k = 1, . . . , 44 



Remark 6.1 It is possible to prove that the power method converges even 
if Ai is a multiple root of Pa(A). On the contrary it does not converge when 
there exist two distinct eigenvalues both with maximum modulus. In that case 
the sequence A^ does not converge to any limit, rather it oscillates between 
two values. 




See the Exercises 6. 1-6.3. 
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6.2 Generalization of the power method 



A first possible generalization of the power method consists of applying 
it to the inverse of the matrix A (provided A is non singular!). Since the 
eigenvalues of A -1 are the reciprocals of those of A, the power method 
in that case allows us to approximate the eigenvalue of A of minimum 
modulus. In this way we obtain the so-called inverse power method : 
given an initial vector we set y(°) = x(°)/||x(°) || and, for k > 1, 
we compute 



v (k) 

x (fe ) = A _ 1 y (fc_1) , y( fc ) = jj ^ ' , M (fc) = (y (fe) ) T A _ 1 y (fe) 



If A admits linearly independent eigenvectors, and if also the eigen- 
value A n of minimum modulus is distinct from the others, then 

lim fi ^ = 1/A n 

k — >oo 

(i.e. (yu^) -1 tends to A n for k — >• oo). 

At each step k we have to solve a linear system of the form Ax( fc ) = 
It is therefore convenient to generate the LU factorization of A 
(or its Cholesky factorization if A is symmetric and positive definite) 
once for all, and then solve two triangular systems at each iteration. 

Example 6.3 When applied to the matrix A (30) of Example 6.1, the inverse 
power method after 7 iterations yields the value 3.5037. Thus the eigenvalue of 
A(30) of minimum modulus will be approximately equal to 1/3.5037 ~ 0.2854. 

A further generalization of the power method stems from the following 
consideration. Let A M denote the (unknown) eigenvalue of A closest to 
a given number (real or complex) pi. In order to approximate A M , we 
can at first approximate the minimum length eigenvalue, say A m j n (A M ), 
of the shifted matrix A M = A — /il, and then set A M = A min (A M ) + 
pi. We can therefore apply the inverse power method to A^ to obtain 
an approximation of A min (A M ). This technique is known as the power 
method with shift , and the number pi is called the shift. 

In the Program 10 we implement the inverse power method with shift. 
By simply setting pi — 0 we recover the inverse power method. The first 
4 input parameters are the same as in Program 9, while mu is the shift. 
Output parameters are the eigenvalue A M of A, its associated eigenvector 
x and the actual number of iterations that have been carried out. 
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Program 10 - invshift: inverse power method with shift 




function [lambda, x,iter]=invshift(A, mu, tol, nmax, xO) 

%INVSHIFT Numerically evaluate one eigenvalue of a matrix. 

% LAMBDA = INVSHIFT(A) compute with the inverse power method the 
% eigenvalue of A of minimum modulus. 

% LAMBDA = INVSHIFT(A,MU) compute the eigenvalue 
% of A closest to the given number (real or complex) MU. 

% LAMBDA = INVSHIFT(A,MU,TOL,NMAX,XO) uses an absolute error 
% tolerance TOL (the default is l.e-6) and a maximum number of iterations 
% NMAX (the default is 100), starting from the initial vector X0. 

% [LAMBDA, V, ITER] - INVSHIFT(A,MU,TOL,NMAX,XO) also returns the eigenvector 
% V such that A*V=LAMBDA*V and the iteration number at which V was computed. 
[n,m]=size(A); 

if n “= m, error('Only for square matrices'); end 
if nargin == 1 

xO = rand(n,l); nmax = 100; tol = l.e-06; mu = 0; 
elseif nargin == 2 

xO = rand(n,l); nmax = 100; tol = l.e-06; 
end 

[L,U]=lu(A-mu*eye(n)); 
if norm(xO) == 0 
xO = rand(n,l); 
end 

x0=x0/norm(x0); 

z0=L\x0; 

pro=U\z0; 

Iambda=x0'*pro; 

err=tol*abs(lambda)+l; iter=0; 

while err > tol*abs(lambda) & abs(lambda) ~= 0 & iter <= nmax 
x = pro; x = x/norm(x); 
z=L\x; pro=U\z; 
lambdanew = x’*pro; 
err = abs(lambdanew - lambda); 
lambda = lambdanew; 
iter = iter + 1; 
end 

lambda = 1/lambda + mu; 



Example 6.4 For the matrix A (30) of Example 6.1 we seek the eigenvalue 
closest to the value 17. For that we use the Program 10 with mu=17, tol 
=10~ 10 and x0=[l; 1; 1; 1] . After 8 iterations the Program returns the value 
lambda=l .2183. A less accurate knowledge of the shift would involve more 
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A 




hold 

on/off 

patch 

axis 

& 



iterations. For instance, if we set mu=13 the program will return the value 
17.8208 after 19 iterations. 

The value of the shift can be modified during the iterations, by setting 
H = This yields a faster convergence; however the computational 
cost grows substantially since now at each iteration the matrix does 
change. 

See the Exercises 6. 4-6. 6. 



6.3 How to compute the shift 



In order to apply successfully the power method with shift we need to 
locate (more or less accurately) the eigenvalues of A in the complex 
plane. To this end let us introduce the following definition. 

Let A be a square matrix of dimension n. The Gershgorin circles c\ r ^ 

(c) 

and C\ associated with its z-th row and z-th column are respectively 
defined as 



c\ r) = {z€ C: \z — aa\ < ]T M}, 

n 

C\ C) ={Z&C: | z-a u \< Ml 

j=l,j^i 

C \ r ^ is called the z-th row circle and c\ c ^ the z-th column circle. 

By the Program 1 1 we can visualize in two different windows the row 
circles and the column circles of a matrix. The command hold on allows 
the overlapping of subsequent pictures (in our case, the different circles 
that have been computed in sequential mode). This command can be 
neutralized by the command hold off. 

The command patch was used in order to color the circles, while the 
command axis sets scaling for the x- and y - axes on the current plot. 



Program 11 - gersh circles: Gershgorin circles 

function gershcircles(A) 

%GERSHCIRCLES plots the Gershgorin circles 
% GERSHGORINCIRCLES(A) plots the Gershgorin circles for 
% the square matrix A and its transpose. 

Abs = abs(A); n = max(size(A)); radii = sum(Abs,2)-diag(Abs); 
xcenter = real(diag(A)); ycenter = imag(diag(A)); theta = [0:pi/100:2*pi]; 
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costheta = cos(theta); sintheta = sin(theta); 
x= [] ; y=[] ; figure(l); elf; axis equal; hold on; 
for i = l:n 

x=[x; radii(i)*cos(theta)-fxcenter(i)];y=[y; radii(i)*sin(theta)-bycenter(i)]; 
patch(x(i,:),y(i,:),’red’); 
end 

for i = l:n, plot(x(i,:),y(i,:), , k’,xcenter(i),ycenter(i),’xk'), end 
xmax = max(max(x)); ymax=max(max(y)); 
xmin = min(min(x)); ymin=min(min(y)); 

hold off; figure(2); elf; axis equal; hold on; radii = sum(Abs)-(diag(Abs))’; 
x= []; y=[]; elf; axis equal; hold on; 
for i = l:n 

x=[x; radii(i)*cos(theta)+xcenter(i)];y=[y; radii(i)*sin(theta)+ycenter(i)]; 
patch(x(i,:),y(i,:), ’green’) 
end 

for i = l:n, plot(x(i,:),y(i,:),’k\xcenter(i),ycenter(i)/xk'); end 
xmax = max(max(max(x)),xmax); ymax=max(max(max(y)),ymax); 
xmin = min(min(min(x)),xmin); ymin=min(min(min(y)),ymin); hold off; 
axis([xmin xmax ymin ymax]); figure(l); axis([xmin xmax ymin ymax]); 



Example 6.5 In Figure 6.4 we have plotted the Gershgorin circles associated 
with the matrix 



30 


1 


2 


3 


4 


15 


-4 


-2 


-1 


0 


3 


5 


-3 


5 


0 


-1 



The centers of the circles have been identified by dots. 





Fig. 6.4. Row circles (left) and column circles (right) for the matrix of Ex- 
ample 6.5 

As previously anticipated, Gershgorin circles may be used to locate the 
eigenvalues of a matrix, as stated in the following proposition. 

Proposition 6.1 All eigenvalues of a given matrix Ag C nXn belong 
to the region of the complex plane which is the intersection of the two 




148 



6. Eigenvalues and eigenvectors 



regions formed respectively by the union of the row circles and the union 
of the column circles. 

Moreover, should m row circles (or column circles), with 1 < m < n, 
be disconnected from the union of the remaining n — m circles, then their 
union contains exactly m eigenvalues. 

There is no guarantee that a circle should contain eigenvalues, unless 
it is isolated from the others. The previous result can be applied in order 
to obtain a preliminary guess of the shift, as we show in the following 
example. 

Example 6.6 From the analysis of the row circles of the matrix A (30) of 
Example 6.1 we deduce that the real parts of the eigenvalues of A lie between 
—32 and 48. Thus we can use Program 10 to compute the maximum modulus 
eigenvalue by setting the value of the shift y equal to 48. The convergence 
is achieved in 16 iterations, whereas 24 iterations would be required using 
the power method with the same initial guess x0=[l;l;l;l] and the same 
tolerance tol=l.e-10. 

See the Exercises 6. 7-6. 8. 



6.4 Computation of all the eigenvalues 



Two square matrices A and B having the same dimension are called 
similar if there exists a non singular matrix P such that 

P~ X AP - B. 

Similar matrices share the same eigenvalues. Indeed, if A is an eigenvalue 
of A and x ^ 0 is an associated eigenvector, we have 

BP _1 x = P _1 Ax - AP -1 x, 

that is, A is also an eigenvalue of B and its associated eigenvector is now 
y = P -1 x. 

The methods which allow a simultaneous approximation of all the 
eigenvalues of a matrix are generally based on the idea of transforming 
A (after an infinite number of steps) into a similar matrix with diagonal 
or triangular form, whose eigenvalues are therefore given by the entries 
lying on its main diagonal. 

Among these methods we mention the QR method which is imple- 
mented in MATLAB in the function eig. More precisely, the command 
D=eig(A) returns a vector D containing all the eigenvalues of A. However, 
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by setting [X,D] =eig(A) , we obtain two matrices: the diagonal matrix 
D formed by the eigenvalues of A, and a matrix X whose column vectors 
are the eigenvectors of A. Thus, A*X=X*D. When A is stored in a sparse 
mode the command eigs(A,k) allows the computation of the first k eigs 
eigenvalues of larger modulus of A. 

Should all eigenvalues of a matrix A be distinct, the sequence of eigen- 
values generated by the QR method converges to an upper triangular 
matrix similar to A, which reduces to a diagonal matrix when A is sym- 
metric. A description of the algorithm which stands at the basis of the 
QR method is provided in the Exercises 6.9 and 6.10 and, more thor- 
oughly, in [QSS00], Chapter 5. 



Let us summarize 



1. The power method is an iterative procedure to compute the eigen- 
value of maximum modulus of a given matrix; 

2. the inverse power method allows the computation of the eigenvalue 
of minimum modulus; it requires the factorization of the given 
matrix; 

3. the power method with shift allows the computation of the eigen- 
value closest to a given number; its effective application requires 
some a-priori knowledge of the location of the eigenvalues of the 
matrix, which can be achieved inspecting the Gershgorin circles; 

4. the QR method is a global technique which allows the simultaneous 
approximation of all the eigenvalues of a given matrix. 

See the Exercises 6.9-6.11. 



6.5 What we haven't told you 



We have not analyzed the issue of the condition number of the eigen- 
value problem, which measures the sensitivity of the eigenvalues to the 
variation of the entries of the matrix. The interested reader is referred 
to, for instance, [Wil65], [GL89] and [QSS00], Chapter 5. 

Let us just remark that the eigenvalue computation is not necessarily 
an ill conditioned problem when the condition number of the matrix is 
large. An instance of this is provided by the Hilbert matrix (see Example 
5.8): although its condition number is extremely large, the eigenvalue 
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computation of the Hilbert matrix is well conditioned thanks to the fact 
that the matrix is symmetric and positive definite. 

Besides the QR method, for computing simultaneously all the eigenval- 
ues, we can use the Jacobi method which transforms a symmetric matrix 
into a diagonal matrix, by eliminating, step-by-step, through similarity 
transformations, every off-diagonal element. This method does not ter- 
minate in a finite number of steps since, while a new off-diagonal element 
is set to zero, those previously treated can reassume non-zero values. 

Other methods are the Lanczos method and the method which uses 
the so-called Sturm sequences. For a survey of all these methods see 
[Saa92] . 

The MATLAB library ARPACKC (available through the command 
arpackc arpackc) can be used to compute the eigenvalues of large matrices, and 
can be download from the address 

http : //www . caam . rice . edu/sof t ware /ARP ACK/ 
eigs is a command that uses this library. 

Let us mention that an appropriate use of the deflation technique 
(which consists of a successive elimination of the eigenvalues already 
computed) allows the acceleration of the convergence of the previous 
methods and hence the reduction of their computational cost. 



6.6 Exercises 



Exercise 6.1 Upon setting the tolerance equal to e = 10~ 10 , use the power 
method to approximate the maximum modulus eigenvalue for the following 
matrices, starting from the initial vector = (1, 2,3) T : 



1 


2 


0 ' 




" 0.1 


00 

bo 


0 ' 




■ 0 


-1 


0 " 


1 


0 


0 


, A 2 = 


1 


0 


0 


, a 3 = 


1 


0 


0 


0 


1 


0 




0 


1 


0 




0 


1 


0 



Then comment on the convergence behavior of the method in the three differ- 
ent cases. 

Exercise 6.2 The features of a population of fishes are described by the fol- 
lowing Leslie matrix introduced in Problem 6.2: 



Age interval (months) 




rrn 


Si 


0-3 


6 


0 


0.2 


3-6 


12 


0.5 


0.4 


6-9 


8 


0.8 


0.8 


9-12 


4 


0.3 


- 



Find the vector x of the normalized distribution of this population for different 
age intervals, according to what we have seen in Problem 6.2. 
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Exercise 6.3 Prove that the power method does not converge for matrices 
featuring an eigenvalue of maximum modulus Ai = / ye z ' 3 and another eigen- 
value A2 = 'ye~ tq& , where i = \f—\ and 7,^6!. 

Exercise 6.4 Show that the eigenvalues of A -1 are the reciprocals of those 
of A. 

Exercise 6.5 Verify that the power method is unable to compute the maxi- 
mum modulus eigenvalue of the following matrix, and explain why: 

2 3 

-1 2 
__ 5 _ 2 

3 3 

1 0 

Exercise 6.6 By using the power method with shift, compute the largest 
positive eigenvalue and the largest negative eigenvalue of 
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0 
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0 


0 


1 


3 



A is the so-called Wilkinson matrix and can be generated by the command 
Wilkinson (7). 

Exercise 6.7 By using the Gershgorin circles, provide an estimate of the 
maximum number of the complex eigenvalues of the following matrices: 
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1 ■ 

2 




-5 


0 


1/2 


1/2 ■ 


0 


4 


0 


2 


> B = 


1/2 


2 


1/2 


0 


1 

2 


0 


6 


1 

2 


0 


1 


0 


1/2 


0 


0 


1 


9 




0 


1/4 


1/2 


3 



A = 



r 1 2 

3 3 

1 0 

0 0 
0 0 



Exercise 6.8 Use the result of Proposition 6.1 to find a suitable shift for the 
computation of the maximum modulus eigenvalue of 



A = 



5 0 1-1 

0 2 0 

0 1-11 
- 1-10 0 



Wilkinson 



Then compare the number of iterations as well the computational cost of the 
power method both with and without shift by setting the tolerance equal to 
10 ~ 14 . 
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Exercise 6.9 A matrix A £ R nXn admits a QR factorization if there exist a 
matrix Q £ M nXn , with the orthogonality property Q T Q = I, and an upper 
triangular matrix R £ R nXn , such that A = QR. This factorization can be 
qr obtained by the command [Q,R]=qr(A). Use this command to obtain the QR 
factorization of the matrix A of Exercise 6.7. Then verify that the matrix 
C = RQ is similar to A. 

Exercise 6.10 The following algorithm provides the basis for the so-called 
QR method for computing all the eigenvalues of a matrix A: 
set A^ = A; 

then for k = 0, 1, . . ., compute Q ^ and R ^ such that A ^ = Q( fe )R( fc ); 
next define A (fe+1) = R (fc) Q (fc) . 

Write a MATLAB program that implements this method. Then carry out 
a few iterations on the matrix A of Exercise 6.7. What is the limit of the 
sequence of matrices A^? 

Exercise 6.11 Use the command eig to compute all the eigenvalues of the 
two matrices given in Exercise 6.7. Then check how accurate are the conclu- 
sions drawn on the basis of Proposition 6.1. 




7. Ordinary differential 
equations 



A differential equation is an equation involving one or more derivatives 
of an unknown function. If all derivatives are taken with respect to a 
single independent variable we call it an ordinary differential equation , 
whereas we have a partial differential equation when partial derivatives 
are present. 

The differential equation (ordinary or partial) has order p if p the 
maximum order of differentiation that is present. The next chapter will 
be devoted to the study of a particular kind of second order partial dif- 
ferential equations, the elliptic equations, whereas in the present chapter 
we will deal with ordinary differential equations of first order. 

Ordinary differential equations can describe the evolution of many 
phenomena in various fields, as we can see from the following three ex- 
amples. 

Problem 7.1 (Thermodynamics) Consider a body having internal tem- 
perature T which is set in an environment with constant temperature T e . 
Assume that its mass m is concentrated in a single point. Then the heat 
transfer between the body and the external environment can be described by 
the Stefan-Boltzmann law 

V (*) = e 7 S(T 4 (t)-T e 4 ) 

where t is the time variable, e the Boltzmann constant (equal to 5.6-10 -8 J/m 2 K 4 s 
where J stands for Joule, K for Kelvin and, obviously, m for meter, s for sec- 
ond) , 7 is the emissivity constant of the body, S the area of its surface and v is 
the rate of the heat transfer. The rate of variation of the energy E(t) = mCT(t) 
(where C denotes the specific heat of the material constituting the body) 
equals, in absolute value, the rate v. Consequently, setting T( 0) = To, the 
computation of T(t ) requires the solution of the ordinary differential equation 

dT _ y(t) 
dt mC 



(7.1) 




154 



7. Ordinary differential equations 



See Exercise 7.16. • 

Problem 7.2 (Biology) Consider a population of bacteria in a confined en- 
vironment in which no more than B elements can coexist. Assume that, at 
the initial time, the number of individuals is equal to yo <C B and the growth 
rate of the bacteria is a positive constant C. In this case the rate of change of 
the population is proportional to the number of existing bacteria, under the 
restriction that the total number cannot exceed B. This is expressed by the 
differential equation 



t- c > ('-%)■ (? ' 2) 

whose solution y = y(t) denotes the number of bacteria at time t. 

Assuming that two populations yi and y2 be in competition, instead of (7.2) 
we would have 

^ 7 - = Ciyi (1 - 613/1 - diyi) , 

dt (7.3) 

-jr — — C23/2 (1 — 623/2 — diyi ) , 
at 

where C\ and C2 represent the growth rates of the two populations. The coef- 
ficients di and d2 govern the type of interaction between the two populations, 
while b± and 62 are related to the available quantity of nutrients. The above 
equations (7.3) are called the Lotka-Volterra equations and form the basis of 
various applications. For their numerical solution, see Example 7.6. • 

Problem 7.3 (Electricity) Consider the electrical circuit of Figure 7.1. We 
want to compute the function v(t) representing the potential drop at the ends 
of the capacitor C starting from the initial time t = 0 at which the switch I 
has been turned off. Assume that the inductance L can be expressed as an 
explicit function of the current intensity z, that is L = L(z). The Ohm law 
yields 

d(nL(zi)) . 

e 2 - = 11 R 1 + v, 

at 

where Ri is a resistance. By assuming the current fluxes to be directed as 
indicated in Figure 7.1, upon differentiating with respect to t both sides of the 
Kirchoff law i\ — z 2 + h and noticing that Z3 — Cdv/dt and z 2 = V/R 2 , we 
find the further equation 



dii _ r ,d 2 v 1 dv 
~dt~°di^ + R^dt' 

We have therefore found a system of two differential equations whose solution 
allows the description of the time variation of the two unknowns i\ and v. The 
second equation has order 2. For its solution see Example 7.7. • 
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Fig. 7.1. The electrical circuit of Problem 7.3 



7.1 The Cauchy problem 



We confine ourselves to first order differential equations, as an equation 
of order p > 1 can always be reduced to a system of p equations of order 
1. The case of first order systems will be addressed in Section 7.8. 

An ordinary differential equation in general admits an infinity of so- 
lutions. In order to fix one of them we must impose a further condition 
which prescribes the value taken by this solution at a given point of the 
integration interval. For instance, the equation (7.2) admits the family 
of solutions y(t ) = B'ip(t)/( 1 + ^(t)) w ith ip(t) = e Ct+K , K being an 
arbitrary constant. If we impose the condition y( 0 ) = 1 , we pick up the 
unique solution corresponding to the value K = ln[l /(B — 1 )]. 

We will therefore consider the solution of the so-called Cauchy problem 
which takes the following form: 
find y : I — > M such that 



y'(t) = f(t,y(t)) it el, 
y(to) = 2 / 0 ) 



(7.4) 



where I is an interval of R, / : / x 8 t is a given function and y' 
denotes the derivative of y with respect to t. Finally, to is a point of I 
and yo a given value which is called the initial data. 

In the following proposition we report a classical result of Analysis. 



Proposition 7.1 Assume that the function f{t,y) is 

1. continuous with respect to both arguments; 

2. Lipschitz- continuous with respect to its second argument, that is, 
there exists a positive constant L such that 

\f(t,yi) - f(t,y 2 )\ < L\y 1 -y 2 \, 'it el, Vj/i,j/ 2 eK. 



Then the solution of the Cauchy problem (7.4) exists and is unique. 
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Unfortunately, explicit solutions are available only for very special 
types of ordinary differential equations. In some other cases, the solution 
is available only in implicit form. This is, for instance, the case with the 
equation y r = (y — t)/(y + 1) whose solution satisfies the implicit relation 

1 In (t 2 + y 2 ) + arctg| = C, 

where C is an arbitrary constant. In some other circumstances the so- 
lution is not even representable in implicit form, as in the case of the 
equation y' = e~ l whose general solution can only be expressed through 
a series expansion. 

For all these reasons, we seek numerical methods capable of approxi- 
mating the solution of every family of ordinary differential equations for 
which solutions do exist. 

The common strategy of all these methods consists of subdividing the 
integration interval I = [to, T], with T < -boo, into Nh intervals of length 
h = (T — t 0 )/N h ; h is called the discretization step . Then, at each node 
t n (0 < n < Nh — 1) we seek the unknown value u n which approximates 
y n = y(t n ). The set of values {^o = yo, ^i, • • • , u N h } is our numerical 
solution. 



7.2 Euler methods 



A classical method, the forward Euler method, generates the numerical 
solution as follows 



(7.5) 



Un+l = U n + hfn , n = 0, . . . , N h - 1 



where we have used the shorthand notation f n = f(t n ^u n ). This method 
is obtained by considering the differential equation (7.4) at every node 
t n , n = 1, . . . , Nh and replacing the exact derivative y f (t n ) by means of 
the incremental ratio (4.3). 

In a similar way, using this time the incremental ratio (4.7) to approx- 
imate y / (t n + 1 ), we obtain the backward Euler method 



Un + 1 — + hf n - |_i, n — 0,...,A^ 1 



(7.6) 



Both methods provide an instance of a one-step method since for com- 
puting the numerical solution u n+ 1 at the node t n + 1 we only need the 
information related to the previous node t n . 
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More precisely, in the forward Euler method u n +\ depends exclusively 
on the value u n previously computed, whereas in the backward Euler 
method it depends also on itself through the value / n +i- For this reason 
the first method is called the explicit Euler method and the second the 
implicit Euler method. 

For instance, the discretization of (7.2) by the forward Euler method 
requires at every step the simple computation of 



Un-\-\ — U n “h hClLn (1 U n j H) , 

whereas using the backward Euler method we must solve the non-linear 
equation 



— u n -f- hCu n +i (1 u n -\-i/ B ) . 



Thus, implicit methods are more costly than explicit methods, since 
at every time-level t n + i we must solve a non-linear problem to compute 
u n + 1 . However, we will see that implicit methods enjoy better stability 
properties than explicit ones. 

The forward Euler method is implemented in the Program 12; the 
integration interval is tspan = [tO,tf inal] , odefun is a string which 
contains the function f(t,y(t)) which depends on the variables t and y, 
or an inline function whose first two arguments stand for t and y. 



Program 12 - feu ter: forward Euler method 

function [t,y]=feuler(odefun, tspan, y,Nh,varargin) 

%FEULER Solve differential equations using the forward Euler method. 

% [T,Y] = FEULER(ODEFUN, TSPAN, Y0,NH) with TSPAN = [TO TFINAL] 

% integrates the system of differential equations y’ = f(t,y) from 
% time TO to TFINAL with initial conditions YO using the forward Euler 
% method on an equispaced grid of NH intervals. 

% Function ODEFUN(T,Y) must return a column vector 
% corresponding to f(t,y). Each row in the solution array Y corresponds to 
% a time returned in the column vector T. 

% [T,Y] = FEULER(ODEFUN, TSPAN, Y0,NH, PI, P2,...) passes the additional 

% parameters P1,P2,... to the function ODEFUN as ODEFUN(T,Y,Pl,P2...). 

h=(tspan(2)-tspan(l))/Nh; 

tt=linspace(tspan(l),tspan(2),Nh-bl); 

for t = tt(l:end-l) 

Y — [yi y(end,:) + h*feval(odefun,t,y(end,:),varargin{:})]; 
end 
t=tt; 

The backward Euler method is implemented in the Program 13. Note 
that we have used the function f zero for the solution of the non-linear 
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problem at each step. As initial data for fzero we use the last com- 
puted value of the numerical solution. Note the use of the command 
num2str (mim, s) which serves the purpose of transforming a number 
mim in a string by keeping 16 significant digits. 



Program 13 - beuler: backward Euler method 

function [t,u]=beuler(odefun,tspan,y,Nh,varargin) 

%BEULER Solve differential equations using the backward Euler method. 

% [T,Y] = BEULER(ODEFUN, TSPAN, Y0,NH) with TSPAN = [TO TFINAL] 

% integrates the system of differential equations y' = f(t,y) from 
% time TO to TFINAL with initial conditions YO using the backward Euler 
% method on an equispaced grid of NH intervals. 

% Function ODEFUN(T,Y) must return a column vector 
% corresponding to f(t,y). Each row in the solution array Y corresponds to 
% a time returned in the column vector T. 

% [T,Y] = BEULER(ODEFUN, TSPAN, Y0,NH, PI, P2,...) passes the additional 

% parameters P1,P2,... to the function ODEFUN as ODEFUN(T,Y,Pl,P2...). 
h=(tspan(2)-tspan(l))/Nh; 
tt=linspace(tspan(l),tspan(2),Nh+l); 

u(i)— y; 

syms x; 
y = x; 

for t— tt(2:end) 

fun = inline([’x-’, num2str(h, 16), ’*(’,char(feval(odefun,t,y,varargin{ :})),... 

’)-’,num2str(u(end),16)], V); 
u = [u, fzero(fun,u(end))]; 
end 
t=tt; 



7.2.1 Convergence analysis 



A numerical method is convergent if 



Vn = 0, . . . , Nh, \u n -y n \ < C(h ) 



where C(h) is infinitesimal with respect to h when h tends to zero. If 
C{h) — 0(h p ) for some p > 0, then we say that the method converges 
with order p. In order to verify that the forward Euler method converges, 
we write the error as follows: 



e n = u n -y n = (u n - <) + « - 2/ n ), 



(7.7) 
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where y n = y(t n ) and 

U n — yn— 1 “1“ hf(t n — 1 5 yn— 1 )> 

that is, u* denotes the numerical solution at time £ n which we would 
obtain starting from the exact solution at time t n - 1; see Figure 7.2. The 
term — y n in (7.7) represents the error produced by a single step 
of the forward Euler method, whereas the term u n — u * represents the 
propagation from £ n _i to t n of the error accumulated at the previous 
time- level t n -\. The method converges provided both terms tend to zero 
as h — > 0. Assuming that the second order derivative of y exists and is 
continuous, thanks to (4.5) we find 

h 2 

U n-Vn = yy"(£n), for a suitable t n ). (7-8) 




Fig. 7.2. Geometrical representation of a step of the forward Euler method 

Consequently, u * — y n tends to zero as h — > 0. 

The quantity r n (h) = (w* — y n )/h represents the error that would be 
generated by forcing the exact solution to satisfy the numerical scheme, 
and for this reason is called the local truncation error. The (global) trun- 
cation error is defined as 

r(/i) = max \T n {h)\. 
n=0,...,N h 

We note that these definitions of local and global truncation errors hold 
for a general numerical method, not only for the Euler method. 

In view of (7.8), the truncation error for the forward Euler method 
takes the following form: 



r(h) = Mh/ 2, 



(7.9) 
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where M = max te[t0jT] \f'(t,y(t))\. 

From (7.8) we deduce that lim^o t(/i) = 0, and a method for which 
this happens is said to be consistent Further, we say that it is consistent 
with order p if r{h) = 0(h p ) for a suitable integer p > 1. 

Consider now the other term in (7.7). We have 

u* n -u n = e n _i + h [f(t n -uy n -i) - f(t n -i,u n -i )] . (7.10) 

Since / is Lipschitz continuous with respect to its second argument, we 
obtain 



^n\ ^ (1 hL) |e n — 1 



If eo = 0, the previous relations yield 

\ e n\ — | u n ~ u n \ + \u n ~ y n \ 

< (1 + hL)\e n -i\ + hr n (h ) 

<5 [l + (1 + hL) + . . . + (1 + hL) n 1 j /ir(/i) 
(l + hL) n -l 

= 1 t{K) < r{h). 

We have used the identity 

n— 1 

53(1 + hL) k = [(1 + hL) n - 1 ]/hL, 

k = 0 



the inequality 1 + hL < e hL and we have observed that nh = t n — to- 
Therefore we find 

L(t n -t 0 ) _ i ]\J 

\e n \< —h, Vn = 0,...,N h , (7-H) 

and thus we can conclude that the forward Euler method converges with 
order 1. We can note that the order of this method coincides with the 
order of its local truncation error. This property is shared by many 
numerical methods for the numerical solution of ordinary differential 
equations. 



Remark 7.1 The convergence estimate (7.11) is obtained by simply requiring 
/ to be Lipschitz continuous. A better estimate, precisely |e n | < Mh(t n —to)/2, 
holds if / satisfies the further requirement that df(t,y)/dy < 0 for all t E 
[to,T] and all — oo < y < oo. Indeed, in that case, using Taylor expansion, 
from (7.10) we obtain u* — u n = (1 + hdf /dy(t n -i y 77n))e n -i, thus |u* — u n \ < 
|e n -i|, provided the stability inequality h < 2/ max t \df/dy(t, y(t))\ holds. 
Then |e n | < |u* — u n \ + \e n -i \ < |eo| + nhr(h ), whence we conclude owing to 
(7.9). 
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Remark 7.2 (Consistency) The property of consistency is necessary in or- 
der to get convergence. Actually, should it be violated, at each step the numer- 
ical method would generate an error which is not infinitesimal with respect to 
h. The accumulation with the previous errors would inhibit the global error to 
converge to zero when h — > 0. 

Using a similar argument we can also prove that the backward Euler 
method converges with order 1 with respect to h. 



Example 7.1 Consider the Cauchy problem 

j y'(t) = cos(2 y(t)) t e (0, 1], 

1 2/(0) = 0, 



(7.12) 



whose solution is y(t) = |arc:sin((e 4t — 1)/ (e 4t + 1))- We solve it by the forward 
Euler method (Program 12) and the backward Euler method (Program 13). By 
the following commands we use different values of h: 1/2, 1/4, 1/8,... ,1/512: 

>> tspan=[0,l]; y0=0; f=inline(’cos(2*y)’,’t’,'y ); 

>> u=inline(’0.5*asin((exp(4*t)-l)./(exp(4*t)+l))7t’); 

>> Nh=2; 
for k=l:10 

[t,ufe]=feuler(f,tspan,yO,Nh); fe(k)=abs(ufe(end)-feval(u,t(end))); 
[t,ube]=beuler(f,tspan,yO,Nh); be(k)=abs(ube(end)-feval(u,t(end))); 

Nh = 2*Nh; 
end 

The errors committed at the point t = 1 are stored in the variable fe (forward 
Euler) and be (backward Euler), respectively. Then we apply formula (1.11) 
to estimate the order of convergence. Using the following commands 

>> p=log(abs(fe(l:end-l)./fe(2:end)))/log(2); p(l:2:end) 

1.2898 1.0349 1.0080 1.0019 1.0005 

>> p=log(abs(be(l:end-l)./be(2:end)))/log(2); p(l:2:end) 

0.9070 0.9720 0.9925 0.9981 0.9995 

we can verify that both methods are convergent with order 1. 



Remark 7.3 The error estimate (7.11) was derived by assuming that the 
numerical solution {u n } is obtained in exact arithmetic. Should we account 
for the (inevitable) roundoff-errors, the error might blow up as h approaches 
0 as 0(l/h) (see, e.g., [Atk89]). This circumstance suggests that it might be 
unreasonable to go below a certain threshold h* (which is actually extremely 
tiny) in practical computations. 




See the Exercises 7. 1-7.3. 
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7.3 The Crank-Nicolson method 



Adding together the generic steps of the forward and backward Euler 
methods we find the so-called Crank-Nicolson method 




(7.13) 



It is a one-step implicit method, which is implemented in the Program 
14. Input and output parameters are the same as in the Euler methods. 



Program 14 - cranknic: Crank-Nicolson method 



function [t,u]=cranknic(odefun,tspan,y,Nh,varargin) 

%CRANKNIC Solve differential equations using the Crank-Nicolson method. 

% [T.Y] = CRANKNIC(ODEFUN, TSPAN, Y0,NH) with TSPAN = [TO TFINAL] 
% integrates the system of differential equations y’ = f(t,y) from 
% time TO to TFINAL with initial conditions YO using the Crank-Nicolson 
% method on an equispaced grid of NH intervals. 

% Function ODEFUN(T,Y) must return a column vector 
% corresponding to f(t,y). Each row in the solution array Y corresponds to 
% a time returned in the column vector T. 

% [T,Y] = CRANKNIC(ODEFUN, TSPAN, Y0,NH, PI, P2,...) passes the additional 

% parameters P1,P2,... to the function ODEFUN as ODEFUN(T,Y,Pl,P2...). 

h=(tspan(2)-tspan(l))/Nh; 

tt=linspace(tspan(l),tspan(2),Nh-f 1); 

u(l)=y; 

fold = feval(odefun,tspan(l),y,varargin{:}); 
syms x; 

Y = x; 

for t=tt(2:end) 
y = x; 

fun = inline([’x-’,num2str(h,16),’*0.5*(’,char(feval(odefun,t,y,varargin{ 
num2str(fold,16),’))-(’,num2str(u(end),16),’)’],’x’); 
u = [u, fzero(fun,u(end))]; 
fold = feval(odefun,t,u(end),varargin{:}); 
end 
t=tt; 
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The local truncation error of the Crank-Nicolson method satisfies 
hT n {h) = y(t n ) - y(f n -i) - ^ [ f(t n , y(t n )) + f(t n -i, y(t n - 1 ))] 

t-n 

/ h 

f(t, y(t)) dt- - [ f(t n , y(t n )) + f{t n - 1 , y{t n - 1 ))] • 

t n -l 

The last equality follows from the fundamental theorem of integration 
(which we have recalled in Section 1.4.3). The second term expresses 
the error associated with the trapezoidal rule for numerical integration 
(4.17). If we assume that y £ C 3 and use (4.18), we deduce that 

h 2 

T n {h) = y"'{£,n) for a suitable £ n G (t n -i,t n ). (7-14) 

Thus the Crank-Nicolson method is consistent with order 2, i.e. its lo- 
cal truncation error tends to 0 as h 2 . Using a similar approach to that 
followed for the forward Euler method, we can show that the Crank- 
Nicolson method is convergent with order 2 with respect to h. 

Example 7.2 Let us solve the Cauchy problem (7.12) by using the Crank- 
Nicolson method with the same values of h as used in Example 7.1. As we can 
see, the results confirm that the error tends to zero with order 2: 

>> y0=0; tspan=[0 1]; N=2; f=inline(’cos(2*y)’,’t','y ); 

>> y=’0.5*asin((exp(4*t)-l)./(exp(4*t)+l))’; 

» for k— 1:10 
[tt,u]=cranknic(f,tspan,yO,N); 
t=tt(end); e(k)=abs(u(end)-eval(y)); N=2*N; end 
>> p=log(abs(e(l:end-l)./e(2:end)))/log(2); p(l:2:end) 

1.7940 1.9944 1.9997 2.0000 2.0000 



7.4 Zero-stability 



There is a concept of stability, called zero- stability, which guarantees 
that, in a fixed bounded interval, small perturbations of data yield 
bounded perturbations of the numerical solution when h — > 0. 

More precisely, a numerical method for the approximation of problem 
(7.4), where I = [to,T], is zero-stable if there exists C > 0 such that for 
all S > 0 and for any h sufficiently small 

\z n ~ Un\<C8 , 0 < n < N h , (7.15) 

where C is a constant which might depend on the length of the integra- 
tion interval /, z n is the solution that would be obtained by applying the 
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numerical method at hand to a perturbed problem, and 6 indicates the 
maximum size of the perturbation. Obviously, S must be small enough 
to guarantee that the perturbed problem still has a unique solution on 
the interval of integration. 

For instance, in the case of the forward Euler method, z n satisfies 



^n+l — %n h [f (tm Z n ) H - Pn+l] i 

zo = yo-\- po 



(7.16) 



for 0 < n < Nh — 1, under the assumption that \p n \ < <5, 0 < n < Nh- 
For a consistent one-step method it can be proved that zero-stability 
is a consequence of the fact that / is Lipschitz-continuous with respect 
to its second argument (see, e.g. [QSSOO]), in which case C depends on 
exp((T — to)L), where L is the Lipschitz constant. However, this is not 
necessarily true for other families of methods. Assume for instance that 
the numerical method can be written in the general form 
v p 

u n +i = ^^ a jU n -j + h^^bjfn-j + hb-if n +i, n = p,p+ 1,... (7.17) 

j—0 j = o 

for suitable coefficients {a/J and {bk} and p > 0. This is a linear multistep 
method and p- 1-1 denotes the number of steps. We will see some example 
of multistep methods in Section 7.6. The polynomial 

v 

7r(r) = r p+1 — 

3=0 

is called the first characteristic polynomial associated with the numerical 
method (7.17), and we denote its roots by r^, j = 0, . . . ,p. The method 
(7.17) is zero-stable iff the following root condition is satisfied: 



| rj | < 1 for all j = 0, . . . ,p; 

furthermore 7r f (rj) 0 for those j such that \rj \ = 1 . 



(7.18) 



For example, for the forward Euler method we have p — 0, ao = 1 , 
b - 1 = 0, bo — 1. For the backward Euler method we have p = 0, ao = 1, 
b-i = 1 , bo = 0 and for the Crank-Nicolson method we have p = 0, 
ao = 1, 6-i = 1/2, bo = 1/2. In all cases there is only one root of i r(r) 
which is equal to 1 and therefore all these methods are zero-stable. 

The following property, known as Lax-Ritchmyer equivalence theorem, 
is most crucial in the theory of numerical methods (see, e.g ., [IK66] ) , and 
highlights the fundamental role played by the property of zero-stability: 



Any consistent method is convergent iff is zero-stable. 
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See the Exercises 7. 4-7. 5. 



7.5 Stability on unbounded intervals 



In the previous section we considered the solution of the Cauchy problem 
on bounded intervals. In that context, the number of subintervals 
depends on h and can become infinite only if h goes to zero. 

On the other hand, there are several situations in which we wish to 
integrate the Cauchy problem on very large (virtually infinite) time inter- 
vals. In this case, even if h is fixed, Nh tends to infinity, and then results 
like (7.11) become meaningless as the right hand side of the inequality 
contains an unbounded quantity. We are interested in methods that are 
able to approximate the solution for arbitrarily long time-intervals, even 
with a step-size h relatively “large” . 

Unfortunately, the forward Euler method, which is very cheap to im- 
plement, does not enjoy this property. To see this, let us consider the 
following model problem 



( y'(t) = \y(t), <e(o,oo), 
1 2 /( 0 ) = 1 , 



(7.19) 



where A is a negative real number. The exact solution is y(t) = e xt , which 
tends to 0 as t tends to infinity. Applying the forward Euler method to 
(7.19) we find that 

U 0 = 1, u n+ i = u„(l + Aft) = (1 T Xh) n+1 , n> 0. (7.20) 



Thus lim„^oo u n = 0 iff 




— 1 < 1 + hA < 1, i.e. h < 2/|A| 



(7.21) 



This condition expresses the requirement that, for fixed h, the numeri- 
cal solution should reproduce the behavior of the exact solution when t n 
tends to infinity. If h > 2/ 1 A j , then lim n ^oo \u n \ = -boo; thus (7.21) is a 
stability condition. The property that lim^oo u n = 0 is called absolute 
stability . 

Example 7.3 Let us apply the forward Euler method to solve problem (7.19) 
with A = — 1. In that case we must have h < 2 for absolute stability. In Figure 
7.3 we report the solutions obtained on the interval [0, 30] for 3 different values 
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of h: h = 30/14(> 2), h = 30/16(< 2) and h = 1/2. We can see that in the 
first two cases the numerical solution oscillates. However only in the first case 
(which violates the stability condition) the absolute value of the numerical 
solution does not vanish at infinity (and actually it diverges). 




Fig. 7.3. Solutions of problem (7.19), with A = — 1, obtained by the forward 
Euler method, corresponding to h — 30/14(> 2) (dashed line), h = 30/16(< 2) 
(solid line) and h— 1/2 (dashed-dotted line) 

Similar conclusions hold when A is a negative function of t in (7.19). 
However in this case, |A| must be replaced by max t€ [ 0 ,oo) |A(£)| in the 
stability condition. This condition could however be relaxed to one which 
is less strict by using a variable step-size h n which accounts for the local 
behavior of |A(£)| in every interval (t n , t n +i). 

In particular, the following adaptive forward Euler method could be 
used: 

choose uo = yo and ho = 2a/|A(£o)|; then 
for n = 0, 1 , . . . , do 



tn+l — tn blni 

U n -\-l — Un + ^nA(tn)t^n5 



(7.22) 



hn + 1 — 2a/|A(t n +i)| 

where a is a constant which must be less than 1 in order to have an 
absolutely stable method. 

For instance, consider the problem 

y\t) = -(10e _t + l)y(t), te( 0,10), 

with 2/(0) = 1. Since |A(t)| is decreasing, the most restrictive condition 
for absolute stability of the forward Euler method is h < ho = 2/|A(0)| = 
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2/11. In Figure 7.4, left, we compare the solution of the forward Euler 
method with that of the adaptive method (7.22) for three values of a. 
Note that, although every a < 1 is admissible for stability purposes, to 
get an accurate solution requires choosing a sufficiently small. In Figure 
7.4, right, we also plot the behaviour of h n corresponding to the same 
values of a. This picture clearly shows that the sequence {h n } increases 
monotonically with n. 





Fig. 7.4. Left: the numerical solution on the time interval (0.2, 0.8) obtained 
by the forward Euler method with h = aho (dashed line) and by the adaptive 
variable stepping forward Euler method (7.22) (solid line) for three different 
values of a. Right: the behavior of the variable step-size h for the adaptive 
method (7.22) 



In contrast to the forward Euler method, neither the backward Eu- 
ler method nor the Crank-Nicolson method require limitations on h for 
absolute stability. In fact, with the backward Euler method we obtain 
u n + 1 = u n + \hu n +i and therefore 



^n- 1-1 



( — - — 
Vl — Aft. 



n+1 



which tends to zero as n — > oo for all values of h > 0. Similarly, with 
the Crank-Nicolson method we obtain 



1 — 





which still tends to zero as n — ► oo for all possible values of h > 0. 
We can conclude that the forward Euler method is conditionally abso- 
lutely stable , while both the backward Euler and Crank-Nicolson methods 
are unconditionally absolutely stable. Methods which are unconditionally 
absolutely stable for all complex numbers A in (7.19) which have real 
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negative part are also called A-stable. Both Euler and Crank-Nicolson 
methods are A-stable. We point out that only implicit methods can be 
A-stable. This property makes implicit methods attractive in spite of 
being computationally more expensive than explicit methods. 



7.5.1 Absolute stability controls perturbations 



Consider now the following generalized model problem 



y'{t) = A (t)y(t) + r(t), t G (0, +oo), 

y( 0 ) = 1 , 



(7.23) 



where A and r are two continuous functions and —A max < A(t) < —A min 
with 0 A ^ A Tnax ^ — |— oo. In this case the exact solution does not 
necessarily tend to zero as t tends to infinity; for instance if both r and 
A are constants we have 

whose limit when t tends to infinity is — r/A. Thus, in general, it does 
not make sense to require a numerical method to be absolutely stable 
when applied to problem (7.23). However, we are going to show that 
a numerical method which is absolutely stable on the model problem 
(7.19), if applied to the generalized problem (7.23), guarantees that the 
perturbations are kept under control as t tends to infinity (possibly under 
a suitable constraint on the time-step h ). 

For the sake of simplicity we will confine our analysis to the forward 
Euler method; when applied to (7.23) it reads 

{ t^n-J-l = "b ^(A n^n “1“ ^n)j ^ 0, 

U 0 = l 

and its solution is (see Exercise 7.10) 

n— 1 n— 1 n—l 

U n = Uq n (1 + h\k) -f hy^rk JJ (1 + ftAj), (7.24) 

k = o k = o j=k -\- 1 

where A& = A (tk) and = r(^). Let us consider the following “per- 
turbed” method 



z n -\- 1 — z n + h(X n z n T r n + p n + 1), n > 0 , 
Zq = uq 4- po , 



(7.25) 
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where po>Pi> • • • are given perturbations which are introduced at every 
time step. This is a simple model in which po and p n +i? respectively, 
account for the fact that neither uo nor r n can be determined exactly. 
(Should we account for all roundoff errors which are actually introduced 
at any step, our perturbed model would be far more involved and diffi- 
cult to analyze.) The solution of (7.25) reads like (7.24) provided Uk is 
replaced by Zk and rk by + pk+i , for all k = 0, . . . , n — 1 . Then 

n— 1 n— 1 n— 1 

z n ~ Un — Po (1 + hXk) + h^^pk+i (1 + hXj). (7.26) 

k= 0 k= 0 j=k+ 1 

It is worth noticing that this quantity does not depend on the function 
r(t). 

i. For the sake of exposition, let us consider first the special case where 
Xk and pk are two constants equal to A and p, respectively. Assume that 
h < ho (A) = 2/| A|, which is the condition on h that ensures the absolute 
stability of the forward Euler method applied to the model problem 
(7.19). Then, using the following identity for the geometric sum 

if l“l < 1; ( 7 - 27 ) 

k= 0 

we obtain 

Z n - U n = pj(l + h\) n (l + 1) - lj . (7.28) 

It follows that the perturbation error \z n — u n \ satisfies (see Exercise 
7.11) 



| Z n ~U n | < <p(A)|p|, 



(7.29) 



with (/?(A) = 1 if A < —1, while tp{ A) = |1 + 2/A| if — 1 < A < 0, and 

lim \z n - u n | = -^7. 
n-+ oo | A | 



For instance, Figure 7.5 corresponds to the case where p = 0.1, A = — 2 
(left) and A = —0.5 (right). In both cases we have taken h = ho(X)~ 0.01. 
The conclusion that can be drawn is that the perturbation error is 
bounded by \p\ times a constant which is independent of n and h. Obvi- 
ously, the perturbation error blows up when n increases if the stability 
limit h < ho is violated. 
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Fig. 7.5. The perturbation error when p = 0.1: A — —2 (left) and A = —0.5 
(right). In both cases h = ho (A) — 0.01 



ii. In the general case where A and r are non constant, let us require 
h to satisfy the restriction h < ho (A), where this time ho (A) = 2/X max . 
Then, 

1 1 T hAfc | ^ CL — tt(h, A rnin ? ^ max ) ~ max{ 1 1 hXjjiin | , 1 1 hX^ax | } . 

Since a < 1, we can still use the identity (7.27) in (7.26) and obtain (see 
Exercise 7.11) 



, | ^ f n , , l~S(h) n \ 

\Z n ^n| ^ Pmax d - h ^ 5(h) / 


, (7.30) 


where p max = max|p fc |, 5(h) = |1 - hA min | if h 
|1 — hX rnax \ if h* < h < 2 /ho (A), having set h* 
When h < h*, it follows that 


< h* while 5(h) — 
— V (Amin T ^max) • 


| Zn | ^ Pmax( & d - 1/^ram)- 


(7.31) 


In particular, 




lim \z n Un\ — Pmax / ^mim 


(7.32) 



n— xx) 



from which we still conclude that the perturbation error is bounded by 
Pmax times a constant which is independent of n and h (although the 
oscillations are no longer damped as in the previous case). 

In fact, similar conclusion holds also when h* < h < 2/ho(A), although 
this does not follow from our upper bound (7.31) which is too pessimistic 
in this case. In Figure 7.6 we report the perturbation errors computed on 
the problem (7.23), where A (t) = — 2 — sin(£), pk = p(tk), p(t) = 0.1 sin(t) 
with h < h* (left) and with h* < h < ho (A) (right). 
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Fig. 7.6. The perturbation error when p(t) = 0.1 sin(t) and A (t) = — 2 — sin(t): 
the step-size is h = h* — 0.01 = 0.39 (left) and h = h* + 0.01 (right) 



Hi. We consider now the general Cauchy problem (7.4). We claim that 
this problem can be related to the generalized model problem (7.23), in 
those cases where 

A max ^ df/dy(t,y) < A m ^ n ,V£ ^ 0) ^ ( 00,OC>), 

for suitable values A m * n , A max £ (0, +oo). To this end, for every t in the 
generic interval (t n ,£ n - |_i), we subtract (7.5) from (7.16) to obtain the 
following equation for the perturbation error: 

Z-n — (^n— 1 l) T h{/ {tn— 1 5 Z n — l) / i^n—1 1 ^n— l)} T hPn • 

By applying the mean value theorem we obtain 

f (tn— 1 7 Z n —\ ) f (t n — 1 , U-tt,— 1) = A n _i(z n _i ^n— 1 ) , 

where A n _i = f y (t n -i, £ n -i), having denoted f y = c?//c?y and £ n _i 
being a suitable point in the interval whose endpoints are u n _i and 
z n - 1 . Thus 



— (1 h\ n —i ) (z n _ 1 1 ) 4“ hp r i • 

By a recursive application of this formula we obtain the identity (7.26), 
from which we derive the same conclusions drawn in ii., provided the 
stability restriction 0 < h < 2/X max holds. 

Example 7.4 Let us consider the Cauchy problem 

y'(t) = arctan(3?/) — 3y 4- £, t > 0, (7.33) 

with y(0) = 1. Since f y = 3/(1 + 9 y 2 ) — 3 is negative, we can choose A max = 
max | f y | = 3 and set h < 2/3. Thus, we can expect that the perturbations on 
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the forward Euler method are kept under control provided that h < 2/3. This 
is confirmed by the results which are reported in Figure 7.7. Note that in this 
example, taking h = 2/3 + 0.01 (thus violating the previous stability limit) the 
pertubation error blows up as t increases. 




Fig. 7.7. The perturbation errors when p(t) = sin(t) with h = 2/Xmax — 0.01 
(thick line) and h = 2/ A max + 0.01 (thin line) for the Cauchy problem (7.33) 



Example 7.5 We seek a limit on h that guarantees stability for the forward 
Euler method applied to approximate the Cauchy problem 

y' = l-y 2 , t > 0, (7. 34 ) 



g 1 

with y(0) = -. The exact solution is y(t) = (e 2t+1 — l)/(e 2t+1 + 1) and 

e + 1 

f y = —2 y. Since f y E (—2, —0.9) for all t > 0, we can take h less than 
ho = 1. In Figure 7.8, left, we report the solutions obtained on the interval 
(0,35) with h = 0.95 (thick line) and h = 1.05 (thin line). In both cases 
the solution oscillates, but remains bounded. Moreover in the first case, which 
satisfies the stability constraint, the oscillations are damped and the numerical 
solution tends to the exact one as t increases. In Figure 7.8, right, we report 
the perturbation errors corresponding to p(t) = sin(t) with h — 0.95 (thick 
line) and h = h* +0.1 (thin line). In both cases the perturbation errors remain 
bounded; moreover in the former case the upper bound (7.32) is satisfied. 

In those cases where no information on y is available, finding the value 
A max = max l/y| is n °I a simple matter. A more heuristic approach 
could be pursued in these situations, by adopting a variable stepping 
procedure. Precisely, one could take t n + 1 = t n + h n , where 

2 |/+n,«n)|’ 
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Fig. 7.8. On the left, numerical solutions of problem (7.34) obtained by the 
forward Euler method with h = 20/19 (thin line) and h = 20/21 (thick line). 
The values of the exact solution are indicated by circles. On the right, per- 
turbation errors corresponding to p(t) = sin(t) with h — 0.95 (thick line) and 
h = h* (thin line) 

for suitable values of a strictly less than 1. Note that the denominator 
depends on the value u n which is known. In Figure 7.9 we report the 
perturbation errors corresponding to the Example 7.5 for two different 
values of a. 




Fig. 7.9. The perturbation errors corresponding to p(t) = sin(t) with a = 0.8 
(thick line) and a = 0.9 (thin line) for the Example 7.5, using the adaptive 
strategy 



Remark 7.4 The previous analysis can be carried out also for other kind of 
one-step methods, in particular for the backward Euler and Crank- Nicolson 
methods. For these methods which are A-stable, the same conclusions about 
the perturbation error can be drawn without requiring any limitation on the 
time-step. In fact, in the previous analysis one should replace each term 1 + hXk 
by (1 — h\k)~ 1 in the backward Euler case and by (1 + hXk/2)/(l — hXk/2) in 
the Crank-Nicolson case. 
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Let us summarize 



1. An absolutely stable method is one which generates a solution u n 
of the model problem (7.19) which tends to zero as t n tends to 
infinity; 

2. a method is said A- stable if it is absolutely stable for any possible 
choice of the time-step h (otherwise a method is called condition- 
ally stable, and h should be lower than a constant depending on 

A); 

3. when an absolutely stable method is applied to a generalized model 
problem (like (7.23)), the perturbation error (that is the absolute 
value of the difference between the perturbed and unperturbed 
solution) is uniformly bounded (with respect to h ). In short we 
can say that absolutely stable method keeps the perturbation con- 
trolled; 

4. the analysis of absolute stability for the linear model problem can 

be exploited to find stability conditions on the time-step when 
considering the nonlinear Cauchy problem (7.4) with a function / 
satisfying df/dy< 0. In that case the stability restriction requires 
the step-size to be chosen as a function of df /dy. Precisely, the 
new integration interval [t n ,£ n +i] is chosen in such a way that 
h n — ^n-t-i t n satisfies h n <c u n ^ j dy\. 

See the Exercises 7.6-7.14. 



7.6 High order methods 



All methods presented so far are elementary examples of one-step meth- 
ods. More sophisticated schemes, which allow the achievement of a higher 
order of accuracy, are the Runge-Kutta methods and the multistep meth- 
ods. Runge-Kutta methods are still one-step methods; however, they in- 
volve several evaluations of the function f{t,y) on every interval [t n , t n + 1 ]. 
One of the most celebrated Runge-Kutta methods reads 
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where 



K X = fn, 

K 2 = f(t n -f \' l u n 4- ^Ki), 

^3 ~ f(tn ■+* \ )W n + 

K 4 = /(^n+l?^n 4“ hKs). 

It is an explicit method of order four with respect to ft; at each time step, 
it involves four new evaluations of the function /. Other Runge-Kutta 
methods, either explicit or implicit, can be constructed with arbitrary 
order. 

They stand at the base of a family of MATLAB programs whose names 
contain the root ode followed by numbers and letters. In particular, ode 
ode45 is based on a pair of explicit Runge-Kutta methods (the so-called ode45 
Dormand-Prince pair) of order 4 and 5, respectively. ode23 is the imple- ode23 
mentation of another pair of explicit Runge-Kutta methods (the Bogacki 
and Shampine pair) . In these methods the integration step varies in order 
to guarantee that the error is below a given tolerance (in these programs 
the default scalar relative error tolerance RelTol is equal to 10~ 3 ). The 
program ode23tb is an implementation of an implicit Runge-Kutta for- ode23tb 
mula whose first stage is the trapezoidal rule, while the second stage is 
a backward differentiation formula of order two (see later). 

Multistep methods (see (7.17)) achieve a high order of accuracy by in- 
volving several values u n , u n -i, . . . , for the determination of u n +\. They 
can be derived for instance by applying the fundamental theorem of in- 
tegration (which we recalled in Section 1.4.3) to the Cauchy problem 
(7.4), obtaining 



tn+l 

Dn+1 = Vn+ J dt, (7.35) 

tn 

and then approximating the integral by a quadrature formula which 
involves the interpolant of / at a suitable set of nodes. A notable example 
of multistep method is the three-steps, third order (explicit) Adams- 
Bashforth formula (AB3) 



(7.36) 




which is obtained by replacing / in (7.35) by its interpolating polynomial 
of degree 2 at the nodes t n - 2 ,t n -i,t n . Another important example is the 
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three-steps, fourth order (implicit) Adams-Moulton formula (AM4) 



u n + 1 — Un 24 (^/n+l + 19/n — 5/ n _i + / n -2) 



(7.37) 



which is obtained by replacing / in (7.35) by its interpolating polynomial 
of degree 3 at the nodes t n - 2 ,t n -i,tmt n +\. 

Another family of multistep methods can be obtained by writing the 
differential equation at time t n + 1 and replacing y r (t n +i) by a one-sided 
incremental ratio of high order. An instance is provided by the formula 



18 9 2 6h 

u n+ 1 ~ ^ u n n Un ~ 2 



(7.38) 



which is known as the three-steps, third order (implicit) backward dif- 
ference formula (BDF3). 

All these methods are consistent and zero-stable. Indeed, the first char- 
acteristic polynomial of both (7.36) and (7.37) is it (r) = r 3 - r 2 and its 
roots are ro = 1, ri = 7*2 = 0, while the first characteristic polynomial 
of (7.38) is 7 r(r) = r 3 — 18/llr 2 + 9/llr — 2/11 and its roots are ro = 1, 
7*1 = 0.3182 -f 0.2839i, r 2 = 0.3182 — 0.2839i, where i is the imaginary 
unit. In all cases, the root condition (7.18) is satisfied. 

Moreover, when applied to the model problem (7.19), AB3 is abso- 
lutely stable if h < 0.54/|A| while AM4 is absolutely stable if h < 3/|A|. 
The method BDF3 is unconditionally absolutely stable for all real neg- 
ative A. However, this is no longer true when A G C with negative real 
part. In other words, BDF3 fails to be A-stable. More generally, there is 
no multistep A-stable method of order strictly greater than two. 

Multistep methods are implemented in several MATLAB programs, for 
ode 15s instance in ode 15s. 



7.7 The predictor-corrector methods 



In Section 7.2 it was pointed out that implicit methods yield at each 
step a nonlinear problem for the unknown value u n +\. For its solution 
we can use one of the methods introduced in Chapter 2, or else apply 
the function fzero as we have done with the Programs 13 and 14. 

Alternatively, we can carry out fixed point iterations at every time- 
step. As an example, for the Crank-Nicolson method (7.13), for k = 
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0, 1, . . we compute until convergence 

U n+1^ = U n + 2 \f n + /(^n+lj^i+l)] * 

It can be proved that if the initial guess is chosen conveniently, 

a single iteration suffices in order to obtain a numerical solution u^+i 
whose accuracy is of the same order as the solution u n +\ of the original 
implicit method. More precisely, if the original implicit method has order 
p, then the initial guess u^+i must be generated by an explicit method 
of order (at least) p — 1. 

For instance, if we use the first order (explicit) forward Euler method 
to initialize the Crank-Nicolson method, we get the Heun method (also 
called the improved Euler method ), which is a second order explicit 
Runge-Kutta method: 



'U'n+1 — H” ^/nj 

u n+l — Un ~ \fn + /(^n-f b\+l)] 



(7.39) 



The explicit step is called a predictor , whereas the implicit one is 
called a corrector. Another example combines the (AB3) method (7.36) 
as predictor with the (AM4) method (7.37) as corrector. These kinds 
of methods are therefore called predictor- corrector methods. They enjoy 
the order of accuracy of the corrector method. However, being explicit, 
they undergo a stability restriction which is typically the same as that of 
the predictor method. Thus they are not adequate to integrate a Cauchy 
problem on unbounded intervals. 

In Program 15 we implement a general predictor-corrector method. 
The strings predictor and corrector identify the type of method 
that is chosen. For instance, if we use the functions ee one step and 
cnonestep, which are defined in Program 16, we can call predcor as 
follows: 

>> [t,u]=predcor(t0,y0,T,N,f,'eeonestep7cnonestep'); 
and obtain the Heun method. 



Program 15 - predcor: predictor-corrector method 

function [t,u]=predcor(odefun,tspan,y,Nh,pred,corr,varargin) 

%PREDCOR Solve differential equations using a predictor-corrector method 
% [T,Y]=PREDCOR(ODEFUN,TSPAN,YO,NH,PRED,CORR) with TSPAN = 
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% [TO TFINAL] integrates the system of differential equations y' = f(t,y) from time 
% TO to TFINAL with initial conditions YO using a general predictor corrector 
% method on an equispaced grid of NH intervals. Function ODEFUN(T,Y) 

% must return a column vector corresponding to f(t,y). Each row in the 
% solution array Y corresponds to a time returned in the column vector T. 

% Functions PRED and CORR identify the type of method that is chosen. 

% [T,Y] = PREDCOR(ODEFUN,TSPAN,YO,NH, PRED, CORR, PI, P2,...) passes 
% the additional parameters P1,P2,... to the functions ODEFUN, PRED and 
% CORR as ODEFUN(T,Y,Pl,P2...), PRED(T I Y,P1 I P2...), CORR(T,Y,Pl,P2...). 
h=(tspan(2)-tspan(l))/Nh; tt=[tspan(l):h:tspan(2)]; 
u=y; [n,m]=size(u); if n j m, u=u’; end 
for t=tt(l:end-l) 

y = u(:,end); fn = feval(odefun,t,y,varargin{:}); 
upre = feval(predictor,t,y,h,fn); 

ucor = feval(corrector,t+h,y,upre,h,odefun,fn,varargin{:}); 
u = [u, ucor]; 
end 
t = tt; 



❖ 



function [u]=feonestep(t,y,h,f) 

u = y + h*f; 

return 

function [u]=beonestep(t,u,y,h,f,fn,varargin) 
u = u + h*feval(f,t,y,varargin{:}); 
return 

function [u]=cnonestep(t,u,y,h,f,fn ( varargin) 
u = u + 0.5*h*(feval(f,t,y,varargin{:})+fn); 
return 

ode 113 The MATLAB program ode 113 implements an Adams Moulton Bash- 
forth scheme with variable step-size. 

See the Exercises 7.15-7.18. 




Program 16 - onestep: one step of forward Euler (eeonestep), 
one step of backward Euler (eionestep), one step of Crank- 
Nicolson {cn one step) 
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Let us consider the following system of first-order ordinary differential 
equations whose unknowns are yi(t ), . . . , 

y'i = 



Vm — fm(t , yi , . . . , y m )i 



where t E (to,T], with the initial conditions 

Vi (to) = 2 / 0 , 1 , ••• ym(to) = yo,m- 

For its solution we could apply to each individual equation one of the 
methods previously introduced for a scalar problem. For instance, the 
n-th step of the forward Euler method would read 

^71+ 1,1 ~ ^71,1 hfl (tni ^71,1 5 • • • 5^71,771)5 

< •’ 



^ ^77+1,771 ^71,771 “t~ ^/l(^7l5 ^ 77 , 1 5 • • * 5 ^TljTTl)* 

By writing the system in vector form y '(£) = F(£, y(£)), with obvious 
choice of notation, the extension of the methods previously developed 
for the case of a single equation to the vector case is straightforward. 
For instance, the method 

"U-ti+i = $F(£ n -|_i , u n -(-i) -f- (1 t?)F(t n , u n ), in ^ 0, 

with uo = yo ? 0 < ft < 1, is the vector form of the forward Euler method 
if 'd = 0, the backward Euler method if $ = 1 and the Crank- Nicolson 
method if # = 1/2. 

Example 7.6 Let us apply the forward Euler method to solve the Lotka- 
Volterra equations (7.3) with C\ = C2 = 1, 61 = 62 = 0 and d\ = = 1. In 

order to use Program 12 for a system of ordinary differential equations, let us 
create a function f which contains the component of the vector function F, 
which we save in the file f . m. For our specific system we have: 

Cl=l; C2=l; dl=l; d2=l; bl=0; b2=0; 
yy(l)=Cl*y( 1 )*( 1 - bl M 1 )- d2 *y( 2 )); % first equation 
y(2)=-C2*y(2)*(l-b2*y(2)-dl*y(l)); % second equation 

y(l)=yy(l); 




180 



7. Ordinary differential equations 



Now we execute Program 12 with the following instructions 

[t , u]— feu ler( T , [0 , 10] , [2 2] , 2000) ; 
plot(t,u) ; figure(2); plot(u(:,l),u(:,2)); 

They correspond to solving the Lotka-Volterra system on the time-interval 
[0, 10] with a time-step h = 0.005. 

The graph in Figure 7.10, left, represents the time evolution of the two 
components of the solution. Note that they are periodic with period 2ir. The 
second graph in Figure 7.10, right, shows the trajectory issuing from the initial 
value in the so-called phase plane , that is, the cartesian plane whose coordinate 
axes are y\ and y<i . This trajectory is confined within a bounded region of the 
(yi, 2 / 2 ) plane. If we start from the point (1.2, 1.2), the trajectory would stay 
in an even smaller region surrounding the point (1,1). This can be explained 
as follows. Our differential system admits 2 points of equilibrium at which 
y[ = 0 and y' 2 = 0, and one of them is precisely (1,1) (the other being (0, 0)). 
Actually, they are obtained by solving the nonlinear system 

y'i = yi - 2 / i 2/2 = 0 , 

2/2 = -2/2 + 2/22/1 = 0. 

If the initial data coincide with one of these points, the solution remains con- 
stant in time. Moreover, while (0,0) is an unstable equilibrium point, (1, 1) is 
stable, that is, all trajectories issuing from a point near (1,1) stay bounded in 
the phase plane. 




Fig. 7.10. Numerical solutions of system (7.3). On the left, we represent y\ 
and 2/2 on the time interval (0, 10), the solid line refers to yi, the dashed line to 
2 / 2 - Two different initial data are considered: (2,2) (thick lines) and (1.2, 1.2) 
(thin lines). On the right, we report the corresponding trajectories in the phase 
plane 

When we use an explicit method, the step-size h should undergo a 
stability restriction similar to the one encountered in Section 7.5. In 
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the case when the real part of the eigenvalues A/~ of the Jacobian A (t) = 
[dF/dy ](t, y) of F are all negative, we can set A = — max* p(A(t)), where 
p(A(t)) is the spectral radius of A (t). This A is a candidate to replace 
the one entering in the stability conditions (such as, e.g ., (7.21)) that 
were derived for the scalar Cauchy problem. 

Remark 7.5 The MATLAB programs (ode23, ode45, ...) that we have men- 
tioned before can be used also for the solution of systems of ordinary differential 
equations. The synthax is odeXXCf ’ , [tO tf] ,y0), where yO is the vector of 
the initial conditions, f is a function to be specifed by the user and odeXX is 
one of the methods available in MATLAB. 

Now we consider the case of an ordinary differential equation of order 
m 



y {m) (t) = f(t,y,y , ,...,y^) (7.40) 

for t £ (to, T], whose solution (when existing) is a family of functions de- 
fined up to m arbitrary constants. The latter can be fixed by prescribing 
m initial conditions 

y{to) = yo, y'(t 0 ) = yi, ... y (m_1) (<o) = Vm- 

Setting 

(t)=y(t), W 2 {t)=y'(t), ... w m (t) = y (m_1) (<), 

the equation (7.40) can be transformed into a first-order system of m 
differential equations 



w[ = w 2 , 
w f 2 = w 3l 



U>m - 1 = Wm, 

w 'm = /(t,Wl,...,W m ), 

with initial conditions 

Wl(to)=yo, W 2 (t 0 )=yu ••• W m {to) = ym-l- 

Thus we can always approximate the solution of a differential equation 
of order m > 1 by resorting to the equivalent system of m first-order 
equations, and then applying to this system a convenient discretization 
method. 
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Example 7.7 Consider the circuit of Problem 7.3 and suppose that L(i\) = L 
is constant and that R\ = R 2 = R . In this case v can be obtained by solving 
the following system of two differential equations: 



v — ic, 



L + RC * 
LCR 



(7.41) 



W ~LC V+ LC ’ 



with initial conditions u(0) = 0, w(0) = 0. The system has been obtained from 
the second-order differential equation 



LC 



d 2 v 

dt* 



+ 



W 2 +RlC ) Tt + 




v — e. 



We set L = 0.1 Henry, C = 10~ 3 Farad, R = 10 Ohm and e — 5 Volt, where 
Henry, Farad, Ohm and Volt are respectively the unit measure of inductance, 
capacitance, resistance and voltage. Now we apply the forward Euler method 
with h = 0.001 seconds in the time interval [0,0.1], by the Program 12: 

>> [t,u]=feuler(’fsys\[0,0.1],[0 0], 1000); 

where fsys is contained in the file fsys.m: 

function y=fsys(t,y) 

L=0.1; C— l.e-03; R=10; e=5; 
yy(l)=y(2); 

y(2)=-(L/R+R*C)/(L*C)*y(2)-2/(L*C)*y(l)+e/(L*C); 

y(i>=yy(i); 

In Figure 7.11 we report the approximated values of v and w. As expected, v(t) 
tends to e/2 = 2.5 Volt for large t. In this case the real part of the eigenvalues 
of A (t) = [dF/dy](t,y) is negative and A can be set equal to —141.4214. Then 
a condition for absolute stability is to take h < 2/|A| = 0.0282. 

See the Exercises 7.19-7.20. 



7.9 What we haven’t told you 



For a complete derivation of the whole family of the Runge-Kutta meth- 
ods we refer to [But 87], [Lam91] and [QSS00], Chapter 11. 

For derivation and analysis of multistep methods we refer to [Arn73] 
and [Lam91]. 

We have not mentioned the so-called stiff problems , e.g. those asso- 
ciated with differential systems for which the Jacobian of F is an ill- 
conditioned matrix. Their numerical solution requires ad hoc implicit 
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Fig. 7.11. Numerical solution of system (7.41). The potential drop v(t) is 
reported on the left, its derivative w on the right 

methods, since otherwise unreasonably small time-steps would be re- 
quired (see [QSSOO], Chapter 11, [Lam91] and [DB02]). The MATLAB 
programs ode 15s, ode23s and ode23tb are suited for stiff problems. 
Consider for instance the Van Der Pol system 

{ y [ = 2/2, 

y' 2 = 1000(1 -yj)y 2 -yi, (7.42) 

2/i (0) = 2, 2/2(0) = 0. 

Using the program ode45 on the interval (0, 3000) would require an un- 
reasonably large CPU time due to the tiny time-step which is imposed by 
stability purpose. On the contrary, ode 15s provides an accurate solution 
with “only” 592 time-steps (see Figure 7.12). 




Fig. 7.12. The first component yi of the solution of (7.42) computed using the 
program odel5s. Note the non-uniform distribution of the integration nodes 

We note that a variable step-size is adopted, which guarantees both 
stability and the desired accuracy. Its choice is based on suitable error 
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indicators which can be associated with the specific method at hand (see 
[Lam91], [SR97] and [DB02]). 



7.10 Exercises 



Exercise 7.1 Apply the backward Euler and forward Euler methods for the 
solution of the Cauchy problem 

y' — sin(t) + y, f E (0, 1], with y(0) = 0, (7.43) 

and verify that both converge with order 1. 

Exercise 7.2 Consider the Cauchy problem 

y f = —te~ y , t € (0, 1], with y( 0) = 0. (7.44) 

Apply the forward Euler method with h — 1/100 and estimate the number of 
exact significant digits of the approximate solution at t = 1 (use the property 
that the value of the exact solution is included between —1 and 0). 

Exercise 7.3 The backward Euler method applied to problem (7.44) requires 
at each step the solution of the non-linear equation: u n + i = u n —ht n +ie~ Uri + 1 = 
<t>{un+ i). The solution u n+ 1 can be obtained by the following fixed-point it- 
eration: for k = 0, 1, ... , compute = 0(t4+i)> with u = u n . Find 

under which restriction on h these iterations converge. 

Exercise 7.4 Repeat Exercise 7.1 for the Crank- Nicolson method. 

Exercise 7.5 Verify that the Crank-Nicolson method can be derived from the 
following integral form of the Cauchy problem (7.4) 

y(t) - yo = [ /(T, y(r))dT 
Jto 

provided that the integral is approximated by the trapezoidal formula (4.17). 

Exercise 7.6 Consider the model problem (7.19) where A is now a complex 
number with negative real part. Note that even in this case u(t) tends to 0 
as t tends to infinity. We define the region of absolute stability of a numerical 
method to be the set of numbers of the complex plane of the form z — h\ 
for which the method is absolutely stable. Determine the region of absolute 
stability associated with the forward Euler method. 

Exercise 7.7 Solve the model problem (7.19) with A = — l + z by the forward 
Euler method and find the values of h for which we have absolute stability. 
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Exercise 7.8 Show that the Heun method defined in (7.39) is consistent. 
Write a MATLAB program to implement it for the solution of the Cauchy 
problem (7.43) and verify experimentally that the method has order of con- 
vergence equal to 2 with respect to h. 

Exercise 7.9 Prove that the Heun method (7.39) is absolutely stable if —2 < 
hX < 0 where A is real and negative. 

Exercise 7.10 Prove formula (7.24). 

Exercise 7.11 Prove the inequality (7.29). 

Exercise 7.12 Prove the inequality (7.30). 

Exercise 7.13 Verify the consistency of the following method (which is an 
explicit Runge-Kutta method of order 3) 

U n +1 = U n + +4/c 2 +fc 3 ), fa = hf(t n ,U n ), 

6 (7.45) 

fe = hf(t n -h Un 4- ^), ks = hfitn+l, U n + 2^2 — k\) . 

Write a MATLAB program to implement it for the solution of the Cauchy 
problem (7.43) and verify experimentally that the method has order of con- 
vergence equal to 3 with respect to h. The methods (7.39) and (7.45) stand at 
the base of the MATLAB program ode23 for the solution of ordinary differen- 
tial equations. 

Exercise 7.14 Prove that the method (7.45) is absolutely stable if —2.5 < 
hX < 0 where A is real and negative. 

Exercise 7.15 The modified Euler method is defined as follows: 

Un + 1 = U n + hf(tn,U n ), U n + 1 = U n + hf(t n +l , ^n+l)- (7-46) 

Find under which condition on h this method is absolutely stable. 

Exercise 7.16 Solve equation (7.1) by the Crank- Nicolson method and the 
Heun method when the body in question is a cube with side equal to 1 m 
and mass equal to 1 Kg. Assume that To = 180 K, T e = 200 K, 7 = 0.5 and 
C = 100J/(Kg/ K). Compare the results obtained by using h = 20 and h = 10, 
for t ranging from 0 to 200 seconds. 

Exercise 7.17 Use MATLAB to compute the region of absolute stability of 
the Heun method. 

Exercise 7.18 Solve the Cauchy problem (7.12) by the Heun method and 
verify its order. 
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Exercise 7.19 The displacement x(t) of a vibrating system represented by a 
body of a given weight and a spring, subjected to a resistive force proportional 
to the velocity, is described by the second-order differential equation x" + 5x' + 
6x — 0. Solve it by the Heun method assuming that x(0) = 1 and x'(0) = 0, 
for t G [0, 5]. 

Exercise 7.20 The motion of a frictionless Foucault pendulum is described 
by the two equations 

x" — 2u sin (ty)y' + k 2 x — 0, y" + 2u cos(^ r )x / + k 2 y = 0, 

where ^ is the latitude of the place where the pendulum is located, oo = 
7.29 • 10 -5 sec -1 is the angular velocity of the Earth, k = yfgjl with g = 9.8 
m/sec 2 and l is the length of the pendulum. Apply the forward Euler method 
to compute x = x(t) and y = y(t) for t ranging between 0 and 300 seconds 
and *I> = 7r/4. 




8. Numerical methods for 
boundary-value problems 



Boundary-value problems are differential problems set in an interval 
(a, b) of the real line or in an open multidimensional region Q C R d 
(d = 2,3) for which the value of the unknown solution (or its deriva- 
tives) is prescribed at the end-points a and b of the interval, or on the 
boundary dQ of the multidimensional region. 

In the multidimensional case the differential equation will involve par- 
tial derivatives of the exact solution with respect to the space coordi- 
nates. Some examples of boundary- value problems are reported below. 

1. Poisson equation : 

-u"{x) = f(x), x G (a, 6), (8.1) 



or (in several dimensions) 

—A ifc(x) — /(x), x G 0, (8.2) 

where / is a given function and A is the so-called Laplace operator : 



Au = Y^ 



i = 1 



dx ? * 



The symbol d • /dxi denotes partial derivative with respect to the 
Xi variable, that is, for every point x° 



du o u(x° + hei) - w(x°) 

— (x u = hm r , 

oxi h 



(8.3) 



where e* is i-th unitary vector of R d . 




188 



8. Numerical methods for boundary- value problems 



2. Heat equation : 

du(x, t) d 2 u(x , t) 
dt ^ dx 2 

or (in several dimensions) 
<9u(x, t) 



= /(a:, ^), x G (a, 6), t > 0, 



dt 



//Au(x, £) = /(x, t), x G O, £ > 0, 



where £ is the time variable, /i > 0 is a given coefficient representing 
the thermal conductivity, and / is again a given function. 



3. Wave equation : 

d 2 u(x,t) d 2 u(x,t) 
dt 2 C <9x 2 

or (in several dimensions) 
<9 2 u(x, t) 



0, x G (a, 6), t > 0, 



dt 2 



— cAu(x, t) — 0, xG (1, t > 0, 



where c is a positive constant. 

We will restrict our attention to equations like (8.1) and (8.2). Equations 
depending on time, like the heat equation and the wave equation, are 
called initial-boundary value problems. In that case an initial condition 
u = uo at t = 0 needs to be prescribed as well. For initial-boundary 
value problems and for more general partial differential equations, the 
reader is referred for instance to [QV94], [EEHJ96] or [Lan03]. 



Problem 8.1 (Thermodynamics) If we are interested in the temperature 
distribution in a square Q with side of length L, we can compute the net energy 
variation in each direction in an infinitesimal square with side of length l L. 
We have 

J(x) - J (x + let) = -Zei J^-(x), 

where J(x) represents the energy transfer for unit time. The Fourier law states 
that J is proportional to the variation of the temperature T, and then 

T/ , T/ , , , d ( . . dT \ 1l2 d 2 T 

J(x) J(x + lei) - lei dxi [kl dxi J-kl Qx ^ , 

where k is a positive constant expressing the thermal conductivity coefficient. 
At equilibrium the sum of energy variations must vanish, and therefore 

AT ( x ) = = 0 

i= 1 i 

which is a Poisson problem with / = 0. • 




8. Numerical methods for boundary- value problems 189 



Problem 8.2 (Hydrogeology) The study of filtration in groundwater can 
lead, in some cases, to an equation like (8.2). Consider a portion 0, occupied 
by a porous medium (like ground or clay). According to the Darcy law, the 
water velocity filtration q = (<?i, # 2 , <? 3 ) T is equal to the variation of the water 
level 0 in the medium, precisely 

q= -K(d<t)/dxi,d<f)/dx2,d(t)/dx3 ) T , (8.4) 

where K is the constant hydraulic conductivity of the porous medium. Assume 
that the fluid density is constant; then the mass conservation principle yields 
the equation divq = 0, where divq is the divergence of the vector q and is 
defined as 



3 

divq = ^2 

i— 1 



dqi 

dxi 



Thanks to (8.4) we therefore find that 0 satisfies the Poisson problem A 4> = 
0 (see Exercise 8.9). • 

The previous Poisson equation (8.2) admits an infinite number of so- 
lutions. With the aim of obtaining a unique solution we must impose 
suitable conditions on the boundary d£l of fi. One possibility is to pre- 
scribe the value of u on 5D, that is 



ii(x) = g(x) for x £ <9D, 



(8.5) 



where g is a given function. 

Problem (8.2), supplemented by the boundary condition (8.5), is a 
Dirichlet boundary-value problem , and is precisely the problem that we 
will face in next section. An alternative to (8.5) is to prescribe a value 
for the partial derivative of u with respect to the normal direction to 
the boundary dQ, in which case we will get a Neumann boundary-value 
problem. 

It can be proven that if / and g are two continuous functions and the 
region Q is regular enough, then the Dirichlet boundary-value problem 
has a unique solution (while the solution of the Neumann boundary- value 
problem is unique up to an additive constant). 

In the one-dimensional case, the Dirichlet boundary- value problem 
takes the following form: being given two constants a and /? and a func- 
tion / = /(x), find a function u = u(x) which satisfies 



-u"(x) = f(x) for x £ (a, 6), 
u(a) — a, u(b) = (3 



(8.6) 
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Performing double integration it is easily seen that if / G C°([a, 6]), the 
solution u exists and is unique; moreover it belongs to C 2 ([a, b]). 

Although (8.6) is an ordinary differential problem, it cannot be cast 
in the form of a Cauchy problem for ordinary differential equations since 
the value of u is prescribed at two different points. 

The numerical methods which are used for its solution are based on 
the same principles which form the basis of the approximation of multi- 
dimensional boundary- value problems. This is the reason why in Section 
8.1 we will make a digression on the numerical solution of problem (8.6). 



8.1 Approximation of boundary-value problems 



We introduce on [a, b] a partition into intervals Ij = [xj,Xj+ 1 ] for j = 
0, . . . , N with #o — a and xn+i = b. We assume for simplicity that all 
intervals have the same length h. 



8.1.1 Approximation by finite differences 



The differential equation must be satisfied in particular at any point Xj 
(which we call nodes from now on) internal to (a, 6), that is 

-u"{xj) = j = 1, . . . , N. 

We can approximate this set of N equations by replacing the second 
derivative with a suitable finite difference as we have done in Chapter 4 
for the first derivatives. In particular, we observe that if u : [a, b] — ► R 
is a sufficiently smooth function in a neighborhood of a generic point 
x G (a, 6), then the quantity 



_o . x u(x + h) — 2 u(x) + u(x — h) 
<*«(*) = - — - — — - — - 



provides an approximation to u n (x) of order 2 with respect to h (see 
Exercise 8.3). This suggests the use of the following approximation to 
problem (8.6): find such that 

_ u m - 2 ., j = h ( 8 . 8 ) 

with uo = a and un+ i = /?• Equations (8.8) provide a linear system 

Au* = h 2 f, (8.9) 
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where = (t/i, . . . , un) T is the vector of unknowns, f — (f(x i) + 
a/h 2 , /(x 2 ), . . . , f(xN- 1), /(xw) + (3/h 2 ) T , and A is the tridiagonal ma- 
trix 



2 -1 0 ... 0 

-1 2 : 

0 -1 0 

: - 12-1 
0 ... 0-12 



( 8 . 10 ) 



This system admits a unique solution since A is symmetric and posi- 
tive definite (see Exercise 8.1). Since A is tridiagonal, this system can 
be solved by the Thomas algorithm introduced in Section 5.4. We note 
however that, for small values of h (and thus for large values of N ), 
A is ill-conditioned. Indeed, K( A) = X rnax (A) / A m * n (A) = Ch ~ 2 , for a 
suitable constant C independent of h (see Exercise 8.2). Consequently, 
the numerical solution of system (8.9), by either direct or iterative meth- 
ods, requires special care. In particular, when using iterative methods a 
suitable preconditioner ought to be employed. 

It is possible to prove (see, e.g ., [QSS00], Chapter 12) that if / G 
C 2 ([a, b\) then 



max I u(xj) — uA < — max I f ,f (x)\ 

j=Q,...,N+l y 3) n - 9 6xe[a,b] v 71 



that is, the finite difference method (8.8) converges with order 2 with 
respect to h. 

In Program 17 we solve the boundary- value problem 



—u"(x) + 5u'(x) + 7 u(x) — f(x) for x G (a, 5), 
u(a) = a u(b) = (3 , 



( 8 . 11 ) 



which is a generalization of problem (8.6). For this problem the finite 
difference method, which generalizes (8.8), reads: 



Uj+ 1 — 2 uj +Uj-i u j+ 1 -t 
h 2 ^ 2 h 



u 0 



a, 



+ 7 uj = f(xj), j 

^iV+l = /?• 



The input parameters of Program 17 are the end-points a and b of the 
interval, the number N of internal nodes, the constant coefficients 5 and 
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7 and the string bvpfun. Finally, ua and ub represent the values that 
the solution should attain at x=a and x=b. Output parameters are the 
vector of nodes x and the computed solution uh. 



Program 17 - bvp: Approximation of a 2 - point boundary- value 
problem by the finite difference method 

function [x,uh]=bvp(a,b,N, delta, gamma, bvpfun, ua,ub,varargin) 

%BVP Solve two-point boundary value problems. 

% [X,UH]=BVP(A,B,N,DELTA,GAMMA,BVPFUN,UA,UB) solves with the 
% centered finite difference method the boundary-value 
% problem 

% -D(DU/DX)/DX+DELTA*DU/DX-fGAMMA*U=BVPFUN 
% on the interval (A,B) with boundary conditions U(A)=UA 
% and U(B)=UB. 

% BVPFUN can be an inline function, 
h = (b-a)/(N+l); 

z = linspace(a,b,N-F2); 
e = ones(N,l); 

A — spdiags([-e-0.5*h*delta 2*e+gamma*ir2 -e+0.5*h*delta], -1:1, N, N); 
x = z(2:end-l); 

f = h / '2*feval(bvpfun,x,varargin{:}); 

f=f; f(l) = f(l) -F ua; f(end) = f(end) + ub; 

uh = A\f; 

uh=[ua; uh; ub]; 

x = z; 



8.1.2 Approximation by finite elements 



The finite element method represents an alternative to the previous finite 
difference method. It is derived from a suitable reformulation of problem 
( 8 . 6 ). 

Let us multiply both sides of (8.6) by a generic function v. Integrating 
the corresponding equality on the interval (a, 6), we obtain 





dx. 



( 8 . 12 ) 



If we assume that v E C ,1 ([a, b ]) and use integration by parts we obtain 




b 

dx - [u\x)v(x)] h a — J f(x)v(x) dx. 

a 
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By making the further assumption that v vanishes at the end-points 
x — a and x — 6, problem (8.6) becomes: find u such that u(a) = a, 
u(b) = (3 and 





(8.13) 



for each v £ C 1 ([a, b]) such that v(a) = v(b) = 0. This is called weak 
formulation of problem (8.6). (Indeed, the test functions v can be less 
regular than C 1 ([a, 6]), see, e.g. [QSS00].) 

Its finite element approximation is defined as follows: 




V h = {v h eC°([a,b ]): v h \ Ij eF 1 ,j = 0,...,N}, 

i.e. Vh is the space of continuous functions on (a, b) whose restrictions 
on every sub-interval Ij are linear polynomials. Moreover, V® is the sub- 
space of Vh of those functions vanishing at the end-points a and b. Vh is 
called space of finite elements of degree 1. 




Fig. 8.1. To the left, a generic function Vh £ To the right, the basis 
function of V® associated with the k- th node 
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The functions in V® are piecewise linear polynomials (see Figure 8.1, 
left). In particular, every function Vh of admits the representation 

N 

Vh{x ) = ^2 v h(Xj)<Pj(x), 

3 = 1 



where 



(pj(x) = { 



X — Xj-1 

x j ~ x j - 1 
X — Xj + 1 

X 3 ~ X 3 + 1 

0 



if x G Ij- 1 , 

if x G Ij , 
otherwise, 



for j = 1, . . . , N. Thus, pk is null at every node Xj except at Xk where 
Pk(xk) — 1 (see Figure 8.1, right). The functions pj, j = 1, . . . , N are 
called shape functions and provide a basis for the vector space . 

Consequently, we can limit ourselves to fulfill (8.14) only for the shape 
functions pj, j = 1, . . . , N. By exploiting the fact that pj vanishes out- 
side the intervals Ij-i and /j, from (8.14) we obtain 



J u' h (x)(f'j(x) dx = j f{x)tpj(x) dx, 'ij = (8.15) 

Ij _ i U Ij Ij _ i U Ij 

On the other hand, we can write Uh(x) = YljLi u j ( Pj( x ) + a Po(x) + 
Ppn+i(x), where Uj = Uh(xj), po(x) = {a + h — x)/h for a < x < a + /z, 
and Pn+i(x) = (x — b + h)/h for b — h < x < 6, while both po(x) and 
Pn+ i(x) are zero otherwise. By substituting this expression in (8.15), 
we find that for all j = 1 , . . . , TV 



Uj-i 



J ^(xWjWdx + uj I ^{x)^{x)dx 

Ij - 1 /j _ i U I j 

+Uj+ 1 J ¥>' +1 (£)<£>'. (x) dx= J f(x)<pj(x) dx + B hj + B NJ 



Ij-lUI, 



where 



B ij = < 



( -a J (foix^ix) dx 



a 

x\ — a 



if j = 1, 



Io 



f 0 otherwise, 
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while 




~P j v'N+i{ x Wj{x) dx = — - if j = N, 

In 

0 otherwise. 



In the special case where all intervals have the same length h, then 
= — 1/h in /j-i, (fj = 1/h in Ij-\ and <p'. = - l/hmlj , <p'- +1 = 1/h 
in Ij . Consequently, we obtain for j = 1 , . . . , N 



—Uj - 1 + 2uj — Wj+i = h j f(x)y>j(x) dx + ^4oj + Bo,j- 

Ij-iUlj 

This linear system has the same matrix as the finite difference system 
(8.9), but a different right-hand side (and a different solution too, in 
spite of coincidence of notation). Finite difference and finite element 
solutions share however the same accuracy with respect to h when the 
nodal maximum error is computed. 

Obviously the finite element approach can be generalized to problems 
like (8.11) (also in the case when 5 and 7 depend on x). A further gener- 
alization consists of using piecewise polynomials of degree greater than 
1, allowing the achievement of higher convergence orders. In these cases, 
the finite element matrix does not coincide anymore with that of finite 
differences, and the convergence order is greater than when using piece- 
wise linear polynomials. 



See Exercises 8. 1-8.8. 



8.2 Finite differences in 2 dimensions 



Let us consider a partial differential equation, for instance equation (8.2), 
in a two-dimensional region Q. 

The idea behind finite differences relies on approximating the partial 
derivatives that are present in the PDE again by incremental ratios com- 
puted on a suitable grid (called the computational grid) made of a finite 
number of nodes. Then the solution u of the PDE will be approximated 
only at these nodes. 

The first step therefore consists of introducing a computational grid. 
Assume for simplicity that D is the rectangle (a, b) x (c, d). Let us in- 
troduce a partition of [a, b] in subintervals (xk,Xk+i) for k — 0 , . . . , N X1 
with xq = a and xa/^+i = b. Let us denote by = {xq, . . . , xn x +\} the 
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set of end points of such intervals and by h x — max (x^+i — Xk) their 

k=0,...,N x 

maximum length. 

In a similar manner we introduce a discretization of the y - axis A y = 
{yo , . . . ,?/jv y +i} with yo = c and yN y + 1 = d. The cartesian product 
Ah = A x x A y provides the computational grid on Q, (see Figure 8.2), 
and h = ma x{h x ,h y } is a characteristic measure of the grid-size. We 
are looking for values Ui j which approximate u{xi,yj). We will assume 
for the sake of simplicity that the nodes be uniformly spaced, that is, 
Xi = x 0 +ih x for 2 = 0,... ,N X + 1 and yj = yo+jh y for j = 0, . . . ,N y + l. 




Fig. 8.2. The computational grid with only 15 internal nodes on a rect- 
angular domain 



The second order partial derivatives of a function can be approximated 
by a suitable incremental ratio, as we did for ordinary derivatives. In the 
case of a function of 2 variables, we define the following incremental 
ratios: 



u x Uj i,3 






hi 



^y U i,3 



h 2 

fly 



.(8.16) 



They are second order accurate with respect to h x and h y , respectively, 
for the approximation of d 2 u/dx 2 and d 2 u/dy 2 at the node (xi,yj). If 
we replace the second order partial derivatives of u with the formula 
(8.16), by requiring that the PDE is satisfied at all internal nodes of A^, 
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we obtain the following set of equations: 

"b ~ /tj 5 ^ = !»•••> N x , j — 1, • • . , Ny. (8.17) 

We have set faj = f(xi,yj). We must add the equations that enforce 
the Dirichlet data at the boundary, which are 



Uij = gij Vz, j such that (xi,yj) G dA (8.18) 



where dA h indicates the set of nodes belonging to the boundary dQ of 
ft. These nodes are indicated by small squares in Figure 8.2. If we make 
the further assumption that the computational grid is uniform in both 
cartesian directions, that is, h x = h y = /z, instead of (8.17) we obtain 



(^ — bj ~b — 1 "b fl T ^z+l,j) — fi,ji 

Z = 1 , . . . , JV# , j = 1 , . . . , iVy 



(8.19) 



The system given by equations (8.19) (or (8.17)) and (8.18) allows the 
computation of the nodal values Uij at all nodes of A^. For every fixed 
pair of indices z and j, equation (8.19) involves 5 unknown nodal values 
as we can see in Figure 8.3. For that reason this finite difference scheme 
is called the 5 point scheme for the Laplace operator. We note that 
the unknowns associated with the boundary nodes can be eliminated 
using (8.18) (or (8.17)), and therefore (8.19) involves only N = N x N y 
unknowns. 



t 
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Fig. 8.3. The stencil of the 5 point scheme for the Laplace operator 

The resulting system can be written in a more interesting form if 
we adopt the lexicographic order according to which the nodes (and, 
correspondingly, the unknown components) are numbered by proceeding 
from left to right, from the top to the bottom. We obtain a system of 
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the form (8.9), with a matrix A £ R NxN which takes the following block 
tridiagonal form: 



" T D 0 

D T 
0 

: D 

0 ... 0 



... 0 " 

D 0 

T D 
D T 



There are N y rows and N y columns, and every entry (denoted by a 
capital letter) consists of a N x x N x matrix. In particular, D £ R N * xN * is 
a diagonal matrix whose diagonal entries are — 1 /h y , while T £ R N * xN x 
is a symmetric tridiagonal matrix 



-22 1 

h? x + M 

J_ 2 2 



0 



0 ... 0 



"'x 

J_ 2 2 1 

M M + h? y ~h? x 

— — 

hi hl + hi _ 



A is symmetric since all diagonal blocks are symmetric. It is also positive 
definite, that is v T Av > 0 Vv G R N , v^O. Actually, by partitioning v 
in N y vectors v; of length N x we obtain 



v TAv = Y^vlTvk v fc v fe+ 1- ( 8 - 20 ) 

k = i y k=i 

We can write T = 2/hyl + l/h^K where K is the (symmetric and 
positive definite) matrix given in (8.10). Consequently, (8.20) becomes 

(vfK Vl + v^Kv 2 + . . . + vJf y Kv Ny )/hl 

which is a strictly positive real number since K is positive definite and 
at least one vector v* is non null. 

Having proven that A is non-singular we can conclude that the finite 
difference system admits a unique solution u^. 
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Fig. 8.4. Pattern of the matrix associated with the 5-point scheme using the 
lexicographic ordering of the unknowns 



The matrix A is sparse since the number of non-null elements is much 
smaller that of null elements. (In Figure 8.4 we report the structure of 
the matrix corresponding to a uniform grid of 11 x 11 nodes. This picture 
was obtained by using the command spy (A). It can be noted that the spy 
only nonzero elements lie on 5 diagonals). 

A sparse matrix can be stored in a special mode that allows an easy 
access to the nonzero elements and preserves only these elements. The 
command sparse that we use in the Program 18 will serve this purpose, sparse 
Since A is symmetric and positive definite, the associated system can be 
solved efficiently by either direct or iterative methods, as illustrated in 
Chapter 5. Finally, it is worth pointing out that A shares with its one- 
dimensional analog the property of being ill-conditioned: indeed, its con- 
dition number grows like h ~ 2 as h tends to zero, where h = max^, h y ). 

In the Program 18 we construct and solve the system (8.17)-(8.18). 

The input parameters a, b, c and d denote the corners of the rectangular 
domain Q = (a, c) x ( 6 , d), while nx and ny denote the values of N x 
and N y (the case N x 7 ^ N y is admitted). Finally, the two strings fun 
and bound represent the right-hand side / = f(x,y ) (otherwise called 
the source term) and the boundary data g = g(x,y). The output is a 
two-dimensional array u whose z, j-th entry is the nodal value Uij. The 
numerical solution can be visualized by the command mesh (x , y , u) . The mesh 
(optional) string uex represents the exact solution of the original problem 
for those cases (of theoretical interest) where this solution is known. In 
such cases the output parameter error contains the nodal relative error 
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between the exact and numerical solution, which is computed as follows: 

error = max|u(x^,^) - Uij\/m&x\u{xi, yj)\. 

id 



Program 18 - poissonfd: Approximation of the Poisson problem 
with Dirichlet data by the five- point finite difference method 

function [u,x,y,error]=poissonfd(a,c,b,d,nx,ny, fun, bound, uex, varargin) 
%POISSONFD two-dimensional Poisson solver 
% [U,X,Y]=POISSONFD(A,C,B,D,NX,NY,FUN,BOUND) solves by 
% the 5-points finite difference scheme the problem 
% -LAPL(U) = FUN in the rectangle (A,C)X(B,D) with 
% Dirichlet boundary conditions U(X,Y)=BOUND(X,Y) for any 
% (X,Y) on the boundary of the rectangle. 

% 

% [U,X,Y,ERROR]=POISSONFD(A,C,B,D,NX,NY,FUN,BOUND,UEX) computes 
% also the maximum nodal error ERROR with respect to the exact 
% solution UEX. 

% FUN, BOUND and UEX can be online functions, 
if nargin == 8 

uex = inline(’0',’x’,’y): 
end 

nx=nx+l; ny=ny+l; 

hx = (b-a)/nx; hy = (d-c)/ny; 
nxl = nx+1; hx2 = hx~2; 

hy2 = hy"2; 

kii = 2/hx2+2/hy2; kix = -l/hx2; kiy = -l/hy2; 
dim = (nx+l)*(ny-f-l); 

K = speye(dim,dim); 
rhs = zeros(dim,l); 
y = c; 

for m = 2:ny 

x = a; y = y + hy; 
for n = 2:nx 

i = n-b(m-l)*(nx-|-l); 
x = x + hx; 

rhs(i) = feval(fun,x,y,varargin{:}; 

K(i,i) = kii; 

K(i.i-l) = kix; 

K(i,i-j-l) = kix; 

K(i,i-hnxl) = kiy; 

K(i,i-nxl) = kiy; 
end 
end 

rhsl = zeros(dim,l); 
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x = [a:hx:b] ; 

rhsl(l:nxl) = feval(bound,x,c,varargin{:}); 
rhsl(dim-nx:dim) = feval(bound,x,d,varargin{:}); 
y = [c:hy:d] ; 

rhsl(l:nxl:dim-nx) = feval(bound,a,y,varargin{:}); 
rhsl(nxl:nxl:dim) = feval(bound,b,y,varargin{:}); 
rhs = rhs - K*rhsl; 

nbound = [[l:nxl],[dim-nx:dim],[l:nxl:dim-nx],[nxl:nxl:dim]]; 
ninternal = setd iff ( [1 : d i m] , nbound); 

K = K(ninternal, ninternal); 
rhs — rhs(ninternal); 
utemp = K\rhs; 
uh = rhsl; 

uh (ninternal) = utemp; 
k = 1; y = c; 
for j = l:ny+l 
x = a; 

for i = l:nxl 
u(i,j) = uh(k); 
k = k + 1; 

ue(ij) = feval(uex,x,y,varargin{:}); 
x = x + hx; 
end 

y = y + hy; 
end 

x = [a : hx: b] ; 
y = [c:hy:dj; 

if nargout == 4 &; nargin == 8 

warning('Exact solution not available’); 
error = [ ]; 
else 

error — max(max(abs(u-ue)))/max(max(abs(ue))); 
end, end 
return 



Example 8.1 The transverse displacement u of an elastic membrane from a 
reference plane Q = (0, l) 2 under a load whose intensity is /(x, y) = 87r 2 sin(27rx) 
cos(27 vy) satisfies a Poisson problem like (8.2) in the domain fi. The Dirichlet 
value of the displacement is prescribed on dQ as follows: g = 0 on the sides 
x = 0 and x = 1, and g(x, 0) = g{x, 1) = sin(27rx), 0 < x < 1. This problem 
admits the exact solution u(x,y) = sin(27rx) cos(2?r y). In Figure 8.5 we show 
the numerical solution obtained by the five-point finite difference scheme on 
a uniform grid. Two different values of h have been used: h — 1/10 (left) and 
h = 1/20 (right). When h decreases the numerical solution improves, and ac- 
tually the nodal relative error is 0.0292 for h — 1/10 and 0.0081 for h = 1/20. 
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Fig. 8.5. Transverse displacement of an elastic membrane computed on two 
uniform grids. On the horizontal plane we report the isolines of the numerical 
solution. The triangular partition of Q only serves the purpose of the visual- 
ization of the results 

Also the finite element method can be easily extended to the two- 
dimensional case. To this end the problem (8.2) must be reformulated in 
an integral form and the partition of the interval (a, b ) in one dimension 
must be replaced by a decomposition of Q by polygons (typically, trian- 
gles) called elements. The shape function will still be a continuous 
function, whose restriction on each element is a polynomial of degree 1 
on each element, which is equal to 1 at the k-th vertex (or node) of the 
triangulation and 0 at all other vertices. For its implementation one can 
pde use the MATLAB toolbox pde. 



8.2.1 Consistency and convergence 



In the previous section we have shown that the solution of the finite 
difference problem exists and is unique. Now we investigate the approx- 
imation error. We will assume for simplicity that h x = h y = h. If 



max| u(xi,yj) - 
hj 



f | — ► 0 as h — » 0, 



the method is called convergent. 

As we have already pointed out, consistency is a necessary condition 
for convergence. A method is consistent if the residual that is obtained 
when the exact solution is plugged into the numerical scheme tends to 
zero when h tends to zero. If we consider the five point finite difference 
scheme, at every internal node (xi,yj) of we define 



Th(xi,yj) = —f(xi, yj) 

u(xi-\,yj) + u(xi,yj-i) - 4u(xi,yj) + u(xi,y j+ i) + u(x i+ i,yj) 

h 2 
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This is the local truncation error at the node (x*, yj). By (8.2) we obtain 



Th{Xi,Dj) 




u(xi-i,yj) - 2u(xi,yj) + u(xi+i,yj)\ 

h 2 J 

v*(xi ? Uj — i ) 2u(xi , yj) -t- u(xi , y j+ i ) ^ 

h 2 J 



Thanks to the analysis that was carried out in Section 8.2 we can con- 
clude that both terms vanish as h tends to 0. Thus 



UmT h (xi,yj) = 0, V(xi,yj) € A h \dA h , 

h — ►() 

that is, the five-point method is consistent. It is also convergent, as stated 
in the following Proposition (for its proof, see, e.g ., [IK66]): 



Proposition 8.1 Assume that the exact solution u E C 4 (f2), i.e. all 
its partial derivatives up to the f.th order are continuous in the closed 
domain Cl. Then there exists a positive constant C such that 



m8ix\u(xi,yj) - Uij 
1,3 



< CMh 2 



where M is the maximum absolute value attained by the fourth order 
derivatives of u in Cl. 

Example 8.2 Let us verify that the 5 point scheme applied to solve the Pois- 
son problem of Example 8.1 converges with order 2 with respect to h. We start 
from h — 1/4 and, then we halve subsequently the value of /i, until h = 1/64, 
through the following instructions: 

>> a=0;b=l ;c=0;d=l; 

» f=inline('8*pi / '2 5,e sin(2*pi*x).*cos(2*pi*y)’,’x’,’y’); 

>> g=inline(’sin(2*pi*x).*cos(2*pi*y)VxVy’); uex=g; nx=4; ny=4; 

>> for n=l:5 

[u,x,y,error(n)]=poissonfd(a,c,b,d,nx,ny,f,g,uex); nx = 2*nx; ny = 2*ny; 
end 

The vector containing the error is 
>> format short e; error 

1.3565e-01 4.3393e-02 1.2308e-02 3.2775e-03 8.4557e-04 

As we can verify using the following commands 

>> log(abs(error(l:end-l)./error(2:end)))/log(2) 

1.6443e+00 1.8179e+00 1.9089e+00 1.9546e+00 
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this error decreases as h 2 . 




Let us summarize 




1. Boundary- value problems are differential equations set in a spa- 
tial domain O C R d (which is an interval if d = 1) that require 
information on the solution on the domain boundary; 

2. finite difference approximations are based on the discretization of 
the given differential equation at selected points (called nodes) 
where derivatives are replaced by finite difference formulae; 

3. the finite difference method provides a nodal vector whose com- 
ponents converge to the corresponding nodal values of the exact 
solution quadratically with respect to the grid-size; 

4. the finite element method is based on a suitable integral reformu- 
lation of the original differential equation, then on the assumption 
that the approximate solution is a piecewise polynomial; 

5. matrices arising from both finite difference and finite element ap- 
proximations are sparse and ill-conditioned. 



See Exercises 8.9-8.10. 



8.3 What we haven't told you 



We could simply say that we have told you almost nothing, since the field 
of numerical analysis which is devoted to the numerical approximation 
of partial differential equations is so broad and multifaceted to deserve 
an entire monograph simply for addressing the most essential concepts 
(see, e.g., [TW98], [EEHJ96]). 

We would like to mention that the finite element method is nowadays 
probably the most widely diffused method for the numerical solution 
of partial differential equations (see, e.g ., [QV94], [Bra97], [BS01]). As 
already mentioned the MATLAB toolbox pdetool allows the solution of a 
broad family of partial differential equations by the linear finite element 
method. 

Other popular techniques are the spectral methods (see, [CHQZ88], 
[Fun92], [BM92], [KS99]) and the finite volume method (see, [Kro98], 
[Hir88] and [LeV02]). 
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Exercise 8.1 Verify that matrix (8.10) is positive definite. 

Exercise 8.2 Verify that the eigenvalues of the matrix (8.10) are 
^=2(1. -cos (J0)), j = 1, . . . , N, 
while the corresponding eigenvectors are 

<b = (sin(j0),sin(2j0), . . . ,sin (nj6)) T , 
where 0 = ir/(n + 1). Deduce that K(A) is proportional to h ~ 2 . 

Exercise 8.3 Prove that the quantity (8.7) provides a second order approxi- 
mation of u"(x) with respect to h. 

Exercise 8.4 Compute the matrix and the right hand side of the numerical 
scheme that we have proposed to approximate problem (8.11). 

Exercise 8.5 Use the finite difference method to approximate the boundary- 
value problem 

f + f u= j; in (°> !)> 

1 u(0) = u(l) = 0, 

where u = u(x ) represents the vertical displacement of a string of length 1, 
subject to a transverse load of intensity w per unit length. T is the tension and 
k is the elastic coefficient of the string. For the case in which w = 1 + sin(47ra;), 
T = 1 and k = 0.1, compute the solution corresponding to /i = 1/i, i — 
10, 20, 40, and deduce the order of accuracy of the method. 

Exercise 8.6 We consider problem (8.11) on the interval (0, 1) with 7 = 0, 
/ = 0, a = 0 and (3=1. Using the Program 17 find the maximum value h C rit 
of h for which the numerical solution is monotone (as is the exact solution) 
when 6 = 100. What happens if 8 = 1000? Suggest an empirical formula for 
h C rit {$ ) as a function of 6, and verify it for several values of S. 

Exercise 8.7 Use the finite difference method to solve problem (8.11) in the 
case where the following Neumann boundary conditions are prescribed at the 
endpoints 



u' (a) = a, u (b) = (3. 



Use the formulae given in (4.10) to discretize u (a) and u (b). 
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Exercise 8.8 Verify that, when using a uniform grid, the right hand side 
of the system associated with the centered finite difference scheme coincides 
with that of the finite element scheme provided that the composite trapezoidal 
formula is used to compute the integrals on the elements Ik - i and Ik . 

Exercise 8.9 Verify that divV</> = A 0, where V is the gradient operator 
that associates to a function u the vector whose components are the first 
order partial derivatives of u. 

Exercise 8.10 Consider a square plate whose side is 20 cm and thermal con- 
ductivity is k = 0.2 cal/sec-cm-C. Denote by Q = 5 cal/cm 3 sec the heat 
production rate per unit area. The temperature T = T(x,y) of the plate sat- 
isfies the equation — AT = Q/k. Assuming that T is null on three sides of the 
plate and is equal to 1 on the fourth side, determine the temperature T at the 
center of the plate. 




9. Solutions of the exercises 



9.1 Chapter 1 



Solution 1.1 Only the numbers of the form ±0.1ai • 2 t with a\ =0,1 and 
t = ±2, ±1,0 belong to the set F(2,2, — 2,2). For a given exponent, we can 
represent in this set only the two numbers 0.10 and 0.11, and their opposites. 
Consequently, the number of elements belonging to F(2, 2, —2, 2) is 20. Finally, 
cm = 1/2. 

Solution 1.2 For any fixed exponent, each of the digits < 22 , . . . , at can assume 
(3 different values, while a\ can assume only (3—1 values. Therefore 2{(3—l)(3 t ~ 1 
different numbers can be represented (the 2 accounts for the positive and 
negative sign). On the other hand, the exponent can assume U — L 3-1 values. 

Thus, the set F(/3, £, L, U) contains 2 ((3 — l)/3 t_1 (U — L 3-1) different elements. 

Solution 1.3 Use the instruction U=2*eye( 10) -3*diag (ones (8, 1) ,2) (respec- 
tively, L=2*eye(10)-3*diag(ones(8, 1) ,-2)). 

Solution 1.4 We can interchange the third and seventh rows of the previous 
matrix using the instructions: r=[l:10]; r(3)=7; r(7)=3; L(r,:). Notice 
that the character : in L(r , : ) ensures that all columns of L are spanned in the L (r 
usual increasing order (from the first to the last). To interchange the fourth 
column with the eighth column we can write c=[l:10]; c(8)=4; c(4)=8; 

L(: ,c). Similar instructions can be used for the upper triangular matrix. 

Solution 1.5 We can define the matrix A = [vl;v2; v3; v4] where vl, v2, 
v3 and v4 are the 4 given row vectors. They are linearly independent iff the 
determinant of A is different from 0, which is not true in our case. 

Solution 1.6 The two given functions / and g have the symbolic expression: 

>> syms x 

>> f=sqrt(x"2±l); pretty(f) 
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(x’ + l) 1 / 2 

>> g=sin(x / '3)+cosh(x); pretty(g) 

sin(x 3 ) + cosh(x) 

syms The command syms x allows one to use x as a symbolic variable. The corn- 
pretty mand pretty (f) prints the symbolic expression f in a format that resembles 
type-set mathematics. At this stage, the symbolic expression of the first and 
second derivatives and the integral of / can be obtained with the following 
instructions: 

>> diff(f,x) 
ans = 

l/(x~2+ini/2)*x 
» diff(f,x,2) 
ans = 

-l/(x"2+l)''(3/2)*x"2+l/(x"2+l)"(l/2) 

>> int(f,x) 
ans = 

l/2*x*(x / '2+l)''(l/2)+l/2*asinh(x) 

Similar instructions can be used for the function g. 

Solution 1.7 The accuracy of the computed roots downgrades as the polyno- 
mial degree increases. This experiment reveals that the accurate computation 
of the roots of a polynomial of high degree can be troublesome. 

Solution 1.8 Here is a possible program to compute the sequence: 

function l=sequence(n) 

I = zeros(n+2,l); 1(1) = (exp(l)-l)/exp(l); 
for i = 0:n, l(i+2) = 1 - (i+l)*l(i+l); end 

The sequence computed from this program doesn’t tend to zero (as n in- 
creases), but it diverges with alternating sign. 

Solution 1.9 The anomalous behavior of the computed sequence is due to 
the propagation of roundoff errors from the innermost operation. In particular, 
when 4 1 ~ n Zn is less than cm, the elements of the sequence are equal to 0. This 
happens from n — 27. 

Solution 1.10 The proposed method is a special instance of a Monte Carlo 
method and is implemented by the following program: 

function mypi=pimontecarlo(n) 
x = rand(n,l); y = rand(n,l); 
z = x.~2+y.~2; 

V = (z <= 1); 

m=sum(v); mypi=:4*m/n; 
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The command rand generates a sequence of pseudo-random numbers. The 
instruction v = (z <= 1) is a shortand version of the following procedure: we 
check whether z(k) <= 1 for any component of the vector z. If the inequality 
is satisfied for the k-th component of z (that is, the point (x(k) ,y(k)) belongs 
to the interior of the unit circle) v(k) is set equal to 1, and to 0 otherwise. 

The command sum(v) computes the sum of all components of v, that is, the sum 
number of points falling in the interior of the unit circle. 

By launching the program as mypi=pimontecarlo(n) for different values 
of n, when n increases, the approximation mypi of ty becomes more accurate. 

For instance, for n=1000 we obtain mypi=3. 1120, whilst for n=300000 we have 
mypi=3 . 1406. 



Solution 1.11 The binomial coefficient can be computed by the following 
program (see also the MATLAB function nchoosek): nchoosek 

function bc=bincoeff(n,k) 
k = fix(k); n = fix(n); 

if k > n, disp(’k must be between 0 and n ') ; break; end 
if k > n/2, k = n-k; end 
if k <= 1, be = n"k; else 

num = (n-k+l):n; den = l:k; el = num./den; be = prod(el); 
end 



The command fix(k) rounds k to the nearest integer smaller than k. The fix 
command disp (string) displays the string, without printing its name. In disp 
general, the command break terminates the execution of for and while loops, break 
If break is executed in an if, it terminates the statement at that point. Finally, 
prod (el) computes the product of all elements of the vector el. prod 
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Solution 2.1 The command fplot allows us to study the graph of the given 
function / for various values of 7. For 7 = 1, the corresponding function does 
not have real zeros. For 7 = 2, there is only one zero, a = 0, with multiplicity 
equal to 4 (that is, f(a) = f'(a) = /"(a) = /"'(a) = 0, while f^ 4 \a) / 0). 
Finally, for 7 = 3,/ has two distinct zeros, one in the interval (—3, —1) and 
the other one in (1,3). In the case 7 = 2, the bisection method cannot be 
used since it is impossible to find an interval (a, 6) in which f(a)f(b) < 0. 
For 7 = 3, starting from the interval [a, b] = [-3,-1], the bisection method 
(Program 1) converges in 34 iterations to the value a = —1.85792082914850 
(with f(a) ~ —3.6 • 10~ 12 ), using the following instructions: 

>> fi=inline(’cosh(x)-fcos(x)-3'); a=-3; b=-l; tol=l.e-10; nmax=200; 

>> [zero I res,niter]=bisection(f,a,b,tol,nmax) 
zero = 

-1.8579 
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res = 

-3.6872e-12 
niter = 

34 

Similarly, for 7 = 4 the bisection method converges after 34 iterations to the 
value a = -2.20562163688010, with /(a) ~ -1.8 • 10“ 10 . 

Solution 2.2 We have to compute the zeros of the function f(V) = pV + 
aN 2 /V — abN 3 /V 2 — pNb — kNT. Plotting the graph of /, we see that this 
function has just a simple zero in the interval (0.01, 0.06) with /( 0.01) < 0 and 
/( 0.06) > 0. We can compute this zero using the bisection method as follows: 

» f=inline(’35000000*V+401000/V-17122.7/V"2-1494500’,’V'); 

>> [zero, res, niter]=bisection(f, 0.01,0. 06, l.e-12, 100) 
zero = 

0.0427 
res = 

-6.3814e-05 
niter = 

35 

Solution 2.3 The unknown value of c 0 is the zero of the function f(u) — 
s(l,u;) — 1 = 9.8/2 [sinh (a;) — sin(u;)]/a; 2 — 1. From the graph of / we conclude 
that / has a unique real zero in the interval (0.5, 1). Starting from this interval, 
the bisection method computes the value u = 0.61214447021484 with the 
desired tolerance in 15 iterations as follows: 

>> f=4nline(’9.8/2*(sinh (omega)- sin(omega))./omega."2 -1’, ’omega’); 

>> [zero,res,niter]=bisection(f,0.5,l,l.e-05,100) 
zero = 

6.1214e-01 
res = 

3.1051e-06 
niter = 

15 

The Newton method with initial value 0.5 would require only 3 iterations 
to provide the same significant digits, however with a much smaller residual: 

>> fx=inline('(49/10*cosh(omega)-49/10*cos(omega)). /omega. "2-2*... 

(49/10*sinh(omega)-49/10*sin(omega)). /omega. "3’, ’omega’) 

>> [zero,res,niter]=newton(f,fx,0.5,l.e-05,100) 
zero = 

6.1214e-01 
res = 

4.4409e-16 
niter = 

3 
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Solution 2.4 The inequality (2.3) can be derived by observing that |e^| < 
|/( fc )|/2 with \I^\ < < 2 ~ k (b — a). Consequently, the error at the 

iteration kmin is less than £ if kmin is such that 2 _fcmin_1 (6 — a) < s, that is, 
2 ~ k min- 1 < s/(b — a), which proves (2.3). 

Solution 2.5 The first formula is less sensitive to the roundoff error. 



Solution 2.6 In Solution 2.1 we have analyzed the zeros of the given function 
with respect to different values of 7. Let us consider the case when 7 = 2. 
Starting from the initial guess x ^ = 1, the Newton method (Program 2) 
converges to the value a = 1.0113e — 04 in 31 iterations with tol=l.e-10 
while the exact zero of / is equal to 0. This big discrepancy is due to the 
fact that / is almost a constant in a neighborhood of its zero. Actually, the 
corresponding residual computed by MATLAB is 0. Let us set now 7 = 3. The 
Newton method with tol=l.e-16 converges to the value 1.85792082915020 in 
10 iterations, while for 7 = 4 it converges to 2.20562163692958 in 12 iterations 
(in both cases the residuals are zero in MATLAB). 



Solution 2.7 The square and the cube roots of a number a are the solutions 
of the equations x 2 — a and x 3 = a, respectively. Thus, the corresponding 
algorithms are: for a given x ^ compute 

0 for the square root, 

k > 0 for the cube root. 



r(M-l) 






x(k) + ^),k> 






Solution 2.8 Setting Sx ^ = x ^ — a, from the Taylor expansion of / we 
find: 

0 = f(a) = f{x (k) ) + 5x ik) f'(x (k) ) + i( ( 5x ik) ) 2 f"(x (k) ) + 0((5x (k) ) 3 ). (9.1) 
The Newton method yields 

«5x ( ' !+1) = 5x (k) - f{x (k) )/f'{x {k) ). (9.2) 



Combining (9.1) with (9.2), we have 

6x (k+1) = i(Sx (k) f^0^ +0((Sx (k) ) 3 ). 

After division by (Sx^) 2 and letting k oo we prove the convergence result. 

Solution 2.9 For certain values of (3 the equation (2.2) can have two roots 
that correspond to different configurations of the rods system. The two initial 
values that are suggested have been chosen conveniently to allow the Newton 
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method to converge toward one or the other root, respectively. We solve the 
problem corresponding to the following values of (3: (3 = 2/c7r/300 with k = 
0, . . . , 100. We use the following instructions to obtain the solution of the 
problem (shown in Figure 9.1): 

» a - [10 13 8 10]; 

>> f = inline(’a(l)/a(2)*cos(beta)-a(l)/a(4)*cos(alpha)-cos(beta-alpha)+... 

(a(l) / '24-a(2) /v 2-a(3)"2+a(4) /v 2)/(2*a(2)*a(4))’, ’alpha ’/beta', ’a’); 

>> df = inline(’a(l)/a(2)*sin(alpha)-sin(beta-alpha)’, ’alpha’, ’beta’, ’a’); 

>> tol = l.e-05; xO = -0.1; xl = 2*pi/3; nmax = 200; 

>> alphal = []; alpha2 = []; niterl = []; niter2 = []; 

>> vbeta = linspace(0,2*pi/3,100); 

>> for beta = vbeta 

[zero, res, niter]=newton(f,df,x0,tol, nmax, beta, a); 
alphal = [alphal, zero]; niterl = [niterl, niter]; 

[zero, res, n iter] = newton (f ,df,xl , tol , n max, beta ,a ) ; 
alpha2 = [alpha2,zero]; niter2 = [niter2, niter]; 
end 

The components of the vectors alphal and alpha 2 are the angles computed for 
different values of (3, while the components of the vectors niterl and niter 2 
are the number of Newton iterations (5 or 6 ) necessary to compute the zeros 
with the requested tolerance. 




Fig. 9.1. The two curves represent the two possible configurations which 
correspond to every value f3 6 [0, 27r/3] 



Solution 2.10 From an inspection of its graph we see that / ha s two positive 
real zeros (02 ~ 1.5 and 0:3 ~ 2.5) and one negative (on ~ —0.5). The Newton 
method converges in 4 iterations (having set x ^ = —0.5 and tol = l.e- 10 ) 
to the value ol\: 

» f=inline(’exp(x)-2*x /v 2’); df=inline(’exp(x)-4*x’); x0=-0.5; tol=l.e-10; 

>> nmax=100; format long; [zero, res, niter]=newton(f,df,x0, tol, nmax) 
zero = 

-0.53983527690282 
res = 
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0 

niter = 

4 

The given function has a maximum at x ~ 0.3574 (which can be obtained 
by applying the Newton method to the function f'): for x ^ < x the method 
converges to the negative zero. If x^ = x the Newton method cannot be 
applied since f'(x) = 0. 

Solution 2.11 Let us set x^ = 0 and tol= 10~ 17 . The Newton method 
converges in 32 iterations to the value 0.64118239763649, which we identify 
with the exact zero a. We can observe that the (approximate) errors x ^ — a, 
for k = 1, 2, . . . , 31, decrease only linearly when k increases. This behavior is 
due to the fact that a has multiplicity greater than 1 (see Figure 9.2). To 
recover a second-order method we can use the modified Newton method. 




Fig. 9.2. Error vs iteration number of the Newton method for the computation 
of the zero of the function f(x) = x 3 — 3x 2 2~ x + 3x4~ x — 8~ x 



Solution 2.12 We should compute the zero of the function f(x) = sin (sc) — 
y/2gh/vQ. From an inspection of its graph, we can conclude that / has one zero 
in the interval (0, 7r/2). The Newton method with sc^ = 7r/4 and tol= 10 -10 
converges in 5 iterations to the value 0.45862863227859. 

Solution 2.13 Using the data given in the exercise, the solution can be ob- 
tained with the following instructions: 

>> f=inline( , M-v*(l+l).*((H-l)." s n - l)./r,T,’M', V,’n’); 

>> df=inline(’-v*((l+l)."n-l)./l-n*v*(l+l). /v n./l+v*(l-|-l).*((l-l-l). /v n-l)./l. / '2',... 
T/MW.V); 

>> [zero ( res,niter]=bisection(f,0.01,0.1,l e-12,4,6000,1000,5); 

>> [zero,res,niter]=newton(f,df, zero, l.e-12, 100, 6000, 1000, 5); 

The Newton method converges to the desired result in 3 iterations. 

Solution 2.14 By a graphical study, we see that (2.15) is satisfied for a 
value of a in (7r/6,7r/4). The Newton method provides the approximate value 
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0.59627992746547 in 5 iterations, starting from x ^ = 7r/4. We deduce that 
the maximum length of a rod that can pass in the corridor is L — 30.84. 



Solution 2.15 If a is a zero of / with multiplicity ra, then there exists a 
function h such that h(a) ^ 0 and f(x) = h(x)(x - a) m . The desired result 
can be obtained by writing both / and its first derivative in the Newton method 
in terms of h. We obtain for the iteration function the following expression 

m ( x \ = x h{x){x-a) m = x _ h(x){x - a) 

h'(x)(x — a) 171 + mh(x)(x — a ) rn ~ 1 h'(x)(x — a) + mh(x) 

By computing its first derivative, we have: 



^ \ [h'(x)(x - a) -f h(x)\D(x) - h(x)(x - a)D'(x) 

where D(x) = h! {x)(x — a)+mh(x). Therefore (f>' N (a ) = 1 — 1/m and thus the 
Newton method converges quadratically only if m = 1. 



Solution 2.16 Let us inspect the graph of / by using the following com- 
mands: 



>> f= ’x / '3+4*x"2-10’; fplot(f, [-10, 10]); grid on; 
>> fplot(f,[-5,5]); grid on; 

>> fplot(f,[0,5]); grid on 



We can see that / has only one real zero, equal approximately to 1.36 (see 
Figure 9.3). The iteration function and its derivative are: 



4>(x) = 



2x 3 + 4x 2 + 10 
3x 2 + Sx 






(6x 2 + Sx)(3x 2 4- 8x) — (6x + 8)(2x 3 + 4x 2 + 10) 
(3x 2 + 8x) 2 



and 0(a) — a. We easily deduce that 4>'(a) = 0 by noting that 0'(x) = 
(6x + 8)/(x)/(3x 2 + 8x) 2 . Consequently, the proposed method converges (at 
least) quadratically. 



Solution 2.17 The proposed method is convergent at least with order 2 since 
0'(a) = 0. 

Solution 2.18 By keeping the remaining parameters unchanged, the method 
converges after only 3 iterations to the value 0.64118573649623 which differs 
by less than 10 -9 from the result previously computed. However, the behavior 
of the function, which is quite flat near to x — 0, suggests that the result 
computed previously could be more accurate. In Figure 9.4 we show the graph 
of / in (0.5, 0.7), obtained with the following instructions: 

» f=’x"3-3*x~2*2"(-x) + 3* x *4 / '(-x) - 8~(-x)’; 

>> fplot(f,[0.5 0.7]); grid on 
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Fig. 9.3. Graph of f(x) = x 3 + 4x 2 — 10 for x G [0, 2] 




for x G [0.5, 0.7] 



9.3 Chapter 3 



Solution 3.1 Since x G (xo,x n ), there exists an interval Ii = (xi~i,Xi) such 
that x G Ii. We can easily see that max x€ /. |(x — Xi~i)(x — Xi)\ = /i 2 /4. If 
we bound |x — Xi+i| above by 2 h, |x — Xi_ 2 | by 3h and so on, we obtain the 
inequality (3.6). 

Solution 3.2 In all cases we have n = 4 and thus we should estimate the fifth 
derivative of each function in the given interval. We find: max x6 [_ 1)1 ] l/i^l < 

1.76, max x6[ _i jl] |/^ 5) | < 1.55, max x€ [_ 7r / 2 , 7r / 2 ] l/i^l < 1.42. The correspond- 
ing errors are therefore bounded by 0.0028, 0.0024 and 0.0212, respectively. 

Solution 3.3 Using the command polyfit we compute the interpolating 
polynomials of degree 3 in the two cases: 

» years=[1975 1980 1985 1990]; 

» east=[70.2 70.2 70.3 71.2]; 

» west=[72.8 74.2 75.2 76.4]; 

>> ceast=polyfit(years,eor,3); 

>> cwest=polyfit(years,eoc,3); 

>> esteast=polyval(ceast,[1970 1983 1988 1995]) 
esteast = 

69.6000 70.2032 70.6992 73.6000 
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>> estwest=polyval(cwest,[1970 1983 1988 1995]) 
estwest = 

70.4000 74.8096 75.8576 78.4000 

Thus, for Western Europe the life expectation in the year 1970 is equal to 
70.4 years (estwest (1)), with a discrepancy of 1.4 years from the real value. 
The symmetry of the graph of the interpolating polynomial suggests that the 
estimation for the life expectation of 78.4 years for the year 1995, can be 
overestimated by the same quantity (in fact, the real life expectation is equal 
to 77.5 years). A different conclusion holds concerning Eastern Europe. Indeed, 
in that case the estimation for 1970 coincides exactly with the real value, while 
the estimation for 1995 is largely overestimated (73.6 years instead of 71.2). 

Solution 3.4 We chose the month as time-unit. The initial time to — 1 corre- 
sponds to November 1987, while £7 = 157 to November 2000. With the follow- 
ing instructions we compute the coefficients of the polynomial interpolating 
the given prices: 

» time = [1 14 37 63 87 99 109 157]; 

» price = [4.5 5 6 6.5 7 7.5 8 8]; 

>> [c] = polyfit(time,price,7); 

Setting [price 2002 ]= polyval(c, 180) we find that the estimated price of the 
magazine in November 2002 is approximately 10.8 euros. 

Solution 3.5 The interpolatory cubic spline, computed by the command spline 
in this special case, coincides with the interpolating polynomial. This wouldn’t 
be true for the natural interpolating cubic spline. 

Solution 3.6 We use the following instructions: 

» T = [4:4:20]; 

>> rho=[1000. 7794, 1000. 6427, 1000. 2805, 999. 7165, 998. 9700]; 

>> Tnew = [6:4:18]; format long e; 

>> rhonew = spline(T,rho,Tnew) 
rhonew = 

Columns 1 through 3 

1.000740787500000e+03 1.000488237500000e+03 1.000022450000000e+03 

Column 4 

9.993649250000000e+02 

The comparison with the further measures shows that the approximation is 
extremely accurate. Note that the state equation for the sea- water (UNESCO, 
1980) assumes a fourth-order dependence of the density on the temperature. 
However, the coefficient of the fourth power of T is of order of 10 -9 . 



Solution 3.7 To compute the interpolatory cubic spline we use the following 
instructions: 
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Fig. 9.5. The cubic spline (continuous line) and interpolating polynomial 
(dotted line) for the data of Exercise 3.7. The circles denote the values used 
in the interpolation 

» year=[1965 1970 1980 1985 1990 1991]; 

» production=[17769 24001 25961 34336 29036 33417]; 

» years=[1962 1977 1992]; 

>> estimateprod=spline(year, production , years); 

We obtain the following values: 5146.1 xlO 5 Kg in 1962; 22641.7 xlO 5 Kg in 
1977; 41894.4 xlO 5 Kg in 1992. The comparison with the real data (12380, 
27403 and 32059 xlO 5 Kg, respectively) shows that the values predicted by 
the spline are inaccurate outside the interpolation interval (see Figure 9.5). 
Nonetheless, the control exerted by the spline at the end-points of the inter- 
polation interval allows us to obtain reasonable values also for 1965. On the 
contrary, the interpolating polynomial introduces large oscillations near this 
end-point and underestimates the production of as many as -77685 x 10 5 Kg 
for 1962. 

Solution 3.8 The interpolating polynomial p and the spline s3 at 21 eq- 
uispaced nodes in [—1,1], can be evaluated in 201 equispaced nodes by the 
following instructions: 

>> pert = l.e-04; 

» x=[-l:2/20:l] ; y=sin(2*pi*x)+(-l).^[l:21]*pert; z=[-l:0.01:l]; 

>> c=polyfit(x,y,20); p=polyval(c,z); s3=spline(x,y,z); 

When we use the unperturbed data (pert=0) the graphs of both p and s3 
are indistinguishable from that of the given function. The situation changes 
dramatically when the perturbed data are used (pert=l . e-04). In particular, 
the interpolating polynomial shows strong oscillations at the end-points of the 
interval, whereas the spline remains practically unchanged (see Figure 9.6). 
This example shows that approximation by splines is in general more stable 
with respect to perturbation errors. 



Solution 3.9 If n = to, setting / = U n f we find that the first member of 
(3.18) is null. Thus in this case II n / is the unique solution of the least-square 
problem. 
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Fig. 9.6. The interpolating polynomial (dotted line) and the interpolatory 
cubic spline (continuous line) corresponding to the perturbed data. Note the 
severe oscillations of the interpolating polynomial near the end-points of the 
interval 



Solution 3.10 The coefficients (obtained by the command polyfit) of the 
requested polynomials are (only the first 4 significant digits are shown): 



K : 


= 0.67, a4 = 


= 6.301 


1(T 8 , 


a 3 = - 


-8.320 


IO -8 , 


a 2 = - 


-2.850 


IO -4 , 


ai 




9.718 10 -4 


, ao = - 


■3.032; 
















K 


= 1.5, (24 = 


-4.225 


10 -8 


, as — 


-2.066 10“ 6 


, a 2 = 


3.444 


10“ 4 , 


ai 




3.36410 -3 , 


ao = 3.364; 
















K 


= 2, a4 = 


-1.012 


io- 7 , 


a 3 = 


-1.431 


10- 7 , 


a 2 = 


6.988 


10“ 4 , 


ai 




-1.060 10“ 


-4 

, a 0 = 


4.927; 
















K 


— 3, (24 — — 


-2.323 


io- 7 , 


a 3 = 


7.980 


10- 7 , 


a 2 = 


1.420 


10~ 3 , 


ai 




-2.605 10“ 


-3 „ 

, ao = 


7.315. 

















In Figure 9.7 we show the graph of the polynomial computed using the data 
in the first column of Table 3.1. 




Fig. 9.7. Least-square polynomial of degree 4 (continuous line) compared with 
the data in the first column of Table 3.1 



Solution 3.11 By repeating the first 3 instructions reported in Solution 3.7 
and using the command polyfit, we find the following values (in 10 5 Kg): 
15280.12 in 1962; 27407.10 in 1977; 32019.01 in 1992, which represent good 
approximations to the real ones (12380, 27403 and 32059, respectively). 
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Solution 3.12 We can rewrite the coefficients of the system (3.20) in terms 
of mean and variance by noting that the variance can be expressed as v = 
1 y^ n x 2 — M 2 

n + 1 Z^i=0 IV1 • 



Solution 3.13 The desired property is deduced from the first equation of the 
system that provides the coefficients of the least-squares straight line. 



9.4 Chapter 4 



Solution 4.1 Using the following third-order Taylor expansions of / at the 
point xo, we obtain 

4/(a;i) = 4 f(x 0 ) + 4hf(x 0 ) + 2h 2 }" (x 0 ) + |h 3 /"'(£)> 

— /(* 2 ) = ~f{x 0 ) - 2hf'(x 0 ) - 2h 2 f"(xo) - | h 3 f'"{r]). 

with £ E (xo,xi) and V £ (^0,^2) as two suitable points. Summing this two 
expressions yields 

9 h 3 

-3f(x 0 ) + 4 f(xi) - f(x 2 ) = 2hf(x 0 ) + ^-(/"'(O - 2 

Dividing by 2 h and using the mean value theorem, we can conclude the exercise 
with £0 G (xo , X2) • A similar procedure can be used for the formula at x n . 

Solution 4.2 Taylor expansions yield 

f(x + h) = f(x) + hf'(x) + y f"(x) + y /'"(£), 

f(x -h) = f(x) - hf'(x) + y f"(x) - y/"'(?7), 

where ^ and 77 are suitable points. Subtracting these two expressions and di- 
viding by 2h we obtain the result (4.9). 

Solution 4.3 Assuming that / E C 4 and proceeding as in Solution 4.2 we 
obtain the following errors (for suitable points £1, £2 and £3): 

«• - b. - C. ±fW(£ 3 )h\ 

Solution 4.4 Using the approximation (4.8), we obtain the following values: 



t (months) 


0 0.5 


1 


1.5 


2 


2.5 3 


Sn 


— 78.04 


44.95 


19.28 


7.02 


2.39 — 


n' 


— 77.78 


39.09 


15.19 


5.29 


1.77 — 



By comparison with the exact values of n(t ) we can conclude that the com- 
puted values are sufficiently accurate. 
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Solution 4.5 The quadrature error can be bounded by 
(6 - o) 3 /(24M 2 ) max \f"(x)\, 

ccE [a,oJ 



where [a, b] is the integration interval and M the (unknown) number of subin- 
tervals. 

The function fi is infinitely differentiable. From the graph of f" we infer 
that \f['(x)\ < 2 in the integration interval. Thus the integration error for fi 
is less than 10 -4 provided that 5 3 /(24M 2 )2 < 10 -4 , that is M > 322. 

Also the function fi is differentiable to any order. Since max crG lO.*]\fS(x)\ = 
v^e 3 / 471- , the integration error is less than 10 4 provided that M > 439. These 
inequalities actually provide an over estimation of the integration errors. In- 
deed, the (effective) minimum number of intervals which ensures that the error 
is below the fixed tolerance of 10 -4 is much lower than that predicted by our 
result (for instance, for the function fi this number is 51). Finally, we note 
that since fz is not differentiable in the integration interval, our theoretical 
error estimate doesn’t hold. 

Solution 4.6 On each interval Ik, k = 1 , . . . , M, the error committed is 
equal to H 3 /24f" (£k) with £ ( Xk-i,Xk ) and hence the global error will 
be H 3 Since f" is a continuous function in (a, b) there exists 
a point £ £ (a, b) such that /"(£) = jj YlkLi /^(Cfc)- Using this result and the 
fact that MH = b — a, we derive equation (4.13). 

Solution 4.7 This effect is due to the accumulation of local errors on each 
sub-interval. 

Solution 4.8 By construction, the mid-point formula integrates exactly the 
constants. To verify that the linear polynomials also are exactly integrated, it 
is sufficient to verify that I(x) = Ipm(x). As a matter of fact we have 

6 2 2 

I{x) = j x dx = b 2 a , Ipm(x) = (b - a)^t_^. 



Solution 4.9 For the function fi we find M — 71 if we use the trapezoidal 
formula and only M — 7 for the Gauss formula. Indeed, the computational 
advantage of this latter formula is evident. 

Solution 4.10 Equation (4.16) states that the quadrature error for the com- 
posite trapezoidal formula with H = Hi is equal to CH \ , with C — f"(0- 

If f" does not vary “too much” , we can assume that also the error with H — 
behaves like CH$. Then, by equating the two expressions 



1(f) ~ h + CHl 1(f) ~ I 2 + CHl 



(9.3) 
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we obtain C = (Ii — ^/(H? — Hi). Using this value in one of the expres- 
sions (9.3), we obtain equation (4.29), that is, a better approximation than is 
produced by I\ or h . 



Solution 4.11 We seek the maximum positive integer p such that I a p P rox(x p ) = 
I ( x p ). For p = 0, 1, 2, 3 we find the following non-linear system with 4 equations 
in the 4 unknowns a, /3, x and z: 



p — 0 

P = 1 
P = 2 
P = 3 



q -(- (3 — b — (Z, 

t 2 2 

b — a 

ax + pz = 



ax 2 + /3z 2 



2 ’ 

6 3 — a 3 
3 ’ 



T 4 4 

_o b — a 

ax 3 + (3z 3 = 



From the first two equations we can eliminate a and z and reduce the system 
to a new one in the unknowns (3 and x. In particular, we find a second-order 
equation in (3 from which we can compute (3 as a function of x. Finally, the 
non-linear equation in x can be solved by the Newton method, yielding two 
values of x that are the abscissae of the Gauss quadrature points. 



Solution 4.12 Since 



zi v-y - ( 1 + ( x _ 7 r ) 2 ) 5 ( 2 x _ 2 7 r ) 4 (1 + (x - tt) 2 ) 4 (2x - 2tt) 2 

24 

+ (l + (z-7r) 2 ) 3 ’ 

/^(x) = — 4e x cos(x) , 

we find that the maximum of \f[ 4 \x)\ is bounded by M\ ~ 25, while that of 
|/ 2 4 \x)| by M 2 ~ 93. Consequently, from (4.22) we obtain H < 0.21 in the 
first case and H < 0.16 in the second case. 



Solution 4.13 Using the command int ( , exp(-x''2/2) * ,0,2) we obtain for 
the integral at hand the value 1.19628801332261. 

The Gauss formula (4.20) applied to the same interval would provide the 
value 1.20278027622354 (with an absolute error equal to 6.4923e-03), while the 
Simpson formula gives 1.18715264069572 with a slightly larger error (equal to 
9.1354e-03). 

Solution 4.14 We note that h > 0 V/c, since the integrand is non- negative. 
Therefore, we expect that all the values produced by the recursive formula 
should be non-negative. Unfortunately, the recursive formula is unstable to 
the propagation of roundoff errors and produces negative elements: 
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» l(l)=l/exp(l); for k=2:20, l(k)=l-k*l(k-l); end 

» 1 ( 20 ) 

-30.1924 

Using the composite Simpson formula, with H < 0.25, we can compute the 
integral with the desired accuracy. 

Solution 4.15 For the Simpson formula we obtain 

Ii = 1.19616568040561, J 2 = 1.19628173356793, => I R = 1.19628947044542, 

with an absolute error in I R equal to -1.4571e-06 (we earn two orders of mag- 
nitude with respect to I\ and a factor 1/4 with respect to / 2 ). Using the Gauss 
formula we obtain (the errors are reported between brackets): 

h = 1.19637085545393 (-8.2842e - 05), 

I 2 = 1.19629221796844 (-4.2046e - 06), 

Ir = 1.19628697546941 (1.0379e - 06). 

The advantage of using the Richardson extrapolation method is evident. 

Solution 4.16 We must compute by the Simpson formula the values j(r) = 
a/(£ 0 r 2 ) / 0 r /(f)df with r = k/10, for k = 1, . . . , 10 and /(£) = e^£ 2 . 

In order to estimate the integration error we need the fourth derivative 
/U)(£) _ e^(£ 2 + 8£ + 12). The maximum of / (4) in the integration interval 
(0, r) is attained at £ = r since / ^ is monotonically increasing. Then we 
obtain the following values: 

» r=[0.1:0. 1: 1] ; 

» maxf4=exp(r).*(rT2+8*r+12); 
maxf4 = 

Columns 1 through 7 

14.1572 16.6599 19.5595 22.9144 26.7917 31.2676 36.4288 

Columns 8 through 10 
42.3743 49.2167 57.0839 

For a given r the error is below 10~ 10 provided that H* < 10 _lo 2880 / ( rf ^ (r)). 
For r — k / 10 with k = 1, . . . , 10 by the following instructions we can compute 
the minimum numbers of subintervals which ensure that the previous inequal- 
ities are satisfied. The components of the vector M contain these numbers: 

» x= [0.1:0. 1:1]; f4-exp(x).*(x. /v 2+8*x+12); 

>> H=(10"(-10)*2880./(x.*f4)). /v (l/4); M=fix(x./H) 

M = 

4 11 20 30 41 53 67 83 100 118 

Therefore, the values of j(r) are: 

>> sigma=0.36; epsilonO = 8.859e-12; 
f = inline(’exp(x).*x."2’); 




9.5 Chapter 5 223 



for k = 1:10 
r = k/10; 

j(k)=simpsonc(0,r,M(k),f); 
j(k) = j(k)*sigma/r*epsilonO; 
end 

Solution 4.17 We compute E(213) using the Simpson composite formula by 
increasing the number of intervals until the difference between two consecutive 
approximations (divided by the last computed value) is less than 10~ n : 

>> f==inline('2.39e-ll./((x. / '5).*(exp(1.432./(T*x))-l))7x7T’); 

>> a=3.e-04; b=14.e-04; T=213; 

>> i=l; err = 1; lold — 0; while err >= l.e-11 

l=simpsonc(a,b,i,f,T); 

err = abs(l-lold)/abs(l); 

lold=l; 

i=i+l; 

end 

The procedure returns the value % — 59. Therefore, using 58 equispaced in- 
tervals we can compute the integral E( 213) with ten exact significant digits. 
The same result could be obtained by the Gauss formula using 53 intervals. 
Note that as many as 1609 intervals would be nedeed if using the composite 
trapezoidal formula. 

Solution 4.18 On the whole interval the given function is not regular enough 
to allow the application of the theoretical convergence result (4.22). One pos- 
sibility is to decompose the integral into the sum of two intervals, (0, 0.5) and 
(0.5, 1), in which the function is regular (it is actually a polynomial of degree 
3). In particular, if we use the Simpson rule on each interval we can even 
integrate / exactly. 
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Solution 5.1 The number r k of algebraic operations (sums, subtractions and 
multiplications) required to compute a determinant of a matrix of order k > 2 
with the Laplace rule (1.8), satisfies the following difference equation: 

r k - kr k - i = 2k - 1, 



with n = 0. Multiplying both side of this equation by 1/A;!, we obtain 

rk _ r k - 1 _ 2k — 1 

k\ (k- 1)! “ k\ ' 

Summing both sides from 2 to n gives the solution: 
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Solution 5.2 We use the following MATLAB commands to compute the de- 
terminants and the corresponding CPU-times: 

>> t = [ ]; for i = 3:500 

A = magic(i); tt = cputime; d=det(A); t=[t, cputime-tt]; 
end 

The coefficients of the cubic least-square polynomial that approximate the 
data n= [3 : 500] and t are 

>> format long; c=polyfit(n,t,3) 
c = 

0.00000002102187 0.00000171915661 -0.00039318949610 0.01055682398911 

The first coefficient (that multiplies n 3 ), is small, but not small enough with 
respect to the second one to be neglected. Indeed, if we compute the fourth 
degree least-square polynomial we obtain the following coefficients: 

>> c=polyfit(i,t,4) 
c = 

Columns 1 through 4 

-0.00000000000051 0.00000002153039 0.00000155418071 -0.00037453657810 
Column 5 
0.01006704351509 

From this result, we can conclude that the computation of a determinant of a 
matrix of dimension n requires approximately n 3 ops. 



Solution 5.3 We have: detAi = 1, detA 2 = e, detA3 = detA = 2s + 12. 
Consequently, if e = 0 the second principal submatrix is singular and the 
Proposition 5.1 cannot be applied. The matrix is singular if e = —6. In this 
case the Gauss factorization yields 





' 1 


0 


0 ' 




' 1 


7 


3 


L = 


2 


1 


0 


, U = 


0 


-12 


-4 




3 


1.25 


1 




0 


0 


0 



Note that U is singular (as we could have predicted since A is singular). 



Solution 5.4 At step 1, n — 1 divisions were used to calculate the hk entries 
for i = 2 , ... ,n. Then (n — l) 2 multiplications and (n — l) 2 additions were 
used to create the new entries a- 2) , for j = 2, . . . , n. At step 2, the numbers of 
divisions is (n — 2), while the numbers of multiplications and addictions will 
be (n — 2) 2 . At final step n — 1 only 1 addiction, 1 multiplication and 1 division 
is required. Thus, using the identies 



£- = 



q(q + 1 ) 



£* 2 



+ 1)( 2 <Z + 1) 



q > i, 



2 



6 
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we can conclude that to complete the Gaussian factorization 2 (n — 1 )n(n + 
l)/3 + n(n — 1) ops are required. Neglecting the lower order terms, we can 
state that the Gaussian factorization process has a cost of 2n 3 /3 ops. 



Solution 5.5 By definition, the inverse X of a matrix A E M nXn satisfies 
XA = AX = I. Therefore, for j = 1, . . . ,n the column vector yy of X is the 
solution of the linear system A yj = ey, where ej is the j-th vector of the 
canonical basis of with all components equal to zero except the j - th that 
is equal to 1. After computing the LU factorization of A, the computation of 
the inverse of A requires the solution of n linear systems with the same matrix 
and different right-hand sides. 

Solution 5.6 Using the Program 7 we compute the following factors: 



■ 1 


0 


0 ' 




1 


1 


3 


2 


1 


0 


, U = 


0 


1 

0° 

bo 

00 

H - 1 

O 

1 

Oi 


14 


3 


-3.38 • 10 15 


1 




0 


0 


4.73 • 1(T 16 



If we compute their product we obtain the matrix 

» L*U 
ans = 

1.0000 1.0000 3.0000 

2.0000 2.0000 20.0000 

3.0000 6.0000 -2.0000 

which differs from A since the entry in position (3,3) is equal to —2 while in 
A it is equal to 4. 

Solution 5.7 Usually, only the triangular (upper or lower) part of a sym- 
metric matrix is stored. Therefore, any operation that does not respect the 
symmetry of the matrix is not optimal in view of the memory storage. This 
is the case when row pivoting is carried out. A possibility is to exchange si- 
multaneously rows and columns having the same index, limiting therefore the 
choice of the pivot only to the diagonal elements. More generally, a pivoting 
strategy involving exchange of rows and columns is called complete pivoting 
(see, e.g ., [QSS00, Chap. 3]). 

Solution 5.8 The L and U factors are: 



1 


0 


0 ' 




' 2 


-2 0 ' 


(e-2)/2 


1 


0 


, U = 


0 


£ 


0 


0 


-1/s 


1 




0 


0 


3 



When e — > 0 Z 32 — > 00 . In spite of that, the solution of the system is accurate 
also when e tends to zero as confirmed by the following instructions: 
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>> e=l; for k=l:10 
b=[0; e; 2]; 

L=[l 0 0; (e-2)*0.5 1 0; 0 -1/e 1]; U=[2 -2 0; 0 e 0; 0 0 3]; 
y=L\b; x=U\y; err(k)=max(abs(x-ones(3,l))); e=e*0.1; 
end 

>> err 
err = 

0000000000 

Solution 5.9 The computed solutions become less and less accurate when 

1 increases. Indeed, the error norms are equal to 2.63 • 10” 14 for i = 1, to 
9.89 • 10“ 10 for i = 2 and to 2.10 • 10~ 6 for i — 3. This can be explained by 
observing that the condition number of A* increases as % increases. Indeed, 
using the command cond we find that the condition number of A * is ~ 10 3 for 

2 = 1, ~ 10 7 for i — 2 and ~ 10 11 for i = 3. 

Solution 5.10 If (A, v) are an eigenvalue-eigenvector pair of a matrix A, then 
A 2 is an eigenvalue of A 2 with the same eigenvector. Indeed, from Av = Av 
follows A 2 v = AAv = A 2 v. Consequently, if A is symmetric and positive 
definite K( A 2 ) = (K( A)) 2 . 

Solution 5.11 The iteration matrix of the Jacobi method is: 

0 0 —a -1 

Bj = 0 0 0 

_ —a -1 0 0 

Its eigenvalues are {0, a -1 , — a” 1 }. Thus the method converges if \a\ > 1. 

The iteration matrix of the Gauss-Seidel method is 

‘ 0 0 —a -1 ' 

Bgs — 0 0 0 

_ o o or 2 

with eigenvalues {0,0, a -2 }. Therefore, the method converges if \a\ > 1. In 
particular, since p(Bgs) = [p(Bj)] 2 , the Gauss-Seidel converges more rapidly 
than the Jacobi method. 

Solution 5.12 A sufficient condition for the convergence of the Jacobi and 
the Gauss-Seidel methods is that A is strictly diagonally dominant. The second 
row of A satisfies the condition of diagonal dominance provided that \(3\ < 5. 
Note that if we require directly that the spectral radii of the iteration matrices 
are less than 1 (which is a sufficient and necessary condition for convergence) , 
we find the (less restrictive) limitation \/3\ < 25 for both methods. 

Solution 5.13 The relaxation method in vector form is 

(I - wD“ 1 E)x (fc+1) = [(1 - w)I + wD _1 F]x (fc) + wD _1 b 
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where A = D — E — F, D being the diagonal of A, and E and F the lower (resp. 
upper) part of A. The corresponding iteration matrix is 

B(w) = (I - wD -1 E) -1 [(l - w)I + wD _1 F], 

If we denote by A; the eigenvalues of B(u;), we obtain 

n 

tpi = |det [(1 - w)I + wD _1 F] | = |1 - u>| n . 

i= 1 

Therefore, at least one eigenvalue must satisfy the inequality |A;| > |1 — lo\. 
Thus, a necessary condition to ensure convergence is that |1 — uj\ < 1, that is, 
0 < uj < 2 . 

Solution 5.14 The given matrix is symmetric. To verify whether it is also 
definite positive, that is, z T Az > 0 for all z ^ 0 of R 2 , we use the following 
instructions: 

>> syms zl z2 real 
» z=[zl;z2]; A=[3 2; 2 6]; 

>> pos=z’*A*z; simple(pos) 
ans = 

3*zl~2+4*zl*z2+6*z2 / '2 

The command syms zl z2 real is necessary to declare that the symbolic 
variables zl and z2 are real numbers, while the command simple (pos) tries 
several algebraic simplifications of pos and returns the shortest. It is easy to see 
that the computed quantity is positive since it can be rewritten as 2* (zl+z2) "2 
+zl~2+4*z2~2. Thus, the given matrix is symmetric and positive definite, and 
the Gauss-Seidel method is convergent. 

Solution 5.15 We find: 

for the Jacobi method: 






T (i) _ I 

X 1 ~ 4 ’ 

,C1) - _ 




p -1 



1/2 0 

0 1/3 
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we have z ^ = P = (—3/4, — 5/6) T . Therefore 

(z< 0 >) T r<°> 77 
Q0 (z( 0 )) T Az(°) 107’ 

and 

x (1) = x (0) + a 0 z (0) = (197/428, -32/321) T . 

Solution 5.16 In the stationary case, p( B a ) = min|l — aA|, where A are the 

eigenvalues of P -1 A. The optimal value of a is obtained solving the equation 
1 1 OiXmin | = |1 — oA max |, that is 1 — oA m / n = — 1 + aA ma x, which yields 

(5.39). Since, 



P(Bq;) — 1 OiXmin Vo ^ Oi 0 pti 

for a = Oio V t we obtain (5.44). 

Solution 5.17 In this case the matrix associated to the Leontieff model is 
not positive definite. Indeed, using the following instructions: 

>> for i=l:20; for j=l:20; c(i,j)=i+j; end; end; A=eye(20)-c; 

>> min(eig(A)) 
ans = 

-448.5830 
>> max(eig(A)) 
ans = 

30.5830 

we can see that the minimum eigenvalue is a negative number and the maxi- 
mum eigenvalue is a positive number. Therefore, the convergence of the gra- 
dient method is not guaranteed. However, since A is non singular, the given 
system is equivalent to the system A r Ax = A T b, where A T A is symmetric 
and positive definite. We solve the latter by the gradient method requiring 
that the norm of the residual be less than 10“ 10 and starting from the initial 
data x^ = 0 T : 

>> b = [1:20]’; aa=A'*A; b=A'*b; xO = zeros(20,l); 

>> [x,iter]=richardson(aa,b,x0,100,l.e-10); 

The method converges in 15 iterations. A drawback of this approach is that the 
condition number of the matrix A T A is, in general, larger than the condition 
number of A. 
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Solution 6.1 Ai: the power method converges in 34 iterations to the value 
2.00000000004989. A 2 : starting from the same initial vector, the power method 
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requires now 457 iterations to converge to the value 1.99999999990611. The 
slower convergence rate can be explained by observing that the two largest 
eigenvalues are very close one another. Finally, for the matrix A3 the method 
doesn’t converge since A3 features two distinct eigenvalues ( i and —i) of max- 
imum modulus. 

Solution 6.2 The Leslie matrix associated with the values in the table is 

0 0.5 0.8 0.3 " 

0.2 0 0 0 

0 0.4 0 0 

0 0 0.8 0 

Using the power method we find Ai ~ 0.5353. The normalized distribution of 
this population for different age intervals is given by the components of the cor- 
responding unitary eigenvector, that is, xi ~ (0.8477, 0.3167, 0.2367, 0.3537) T . 



Solution 6.3 We rewrite the initial guess as 



y (0) = ^(O) 



n 

aixi + a 2 x 2 + 

i = 3 



with j3 ^ = l/||x (0) ||. By calculations similar to those carried out in Section 
6.1, at the generic step k we find: 



(fc) k n(k) ( ikd , —ikd , \ > A z \ 

y v - 7 P I aixie +a 2 x 2 e + ' 

The first two terms don’t vanish and, due to the opposite sign of the exponents, 
the sequence of the y ^ oscillates and cannot converge. 



Solution 6.4 From the eigenvalue equation Ax = Ax, we deduce A *Ax = 
AA _1 x, and therefore A _1 x = (1/A)x. 



Solution 6.5 The power method applied to the matrix A generates an oscil- 
lating sequence of approximations of the maximum modulus eigenvalue (see, 
Figure 9.8). This behavior is due to the fact that this eigenvalue is not unique. 



Solution 6.6 To compute the eigenvalue of maximum modulus of A we use 
Program 9: 

>> A=wilkinson(7); 

>> x0=ones(7,l); tol=l.e-15; nmax=100; 

>> [lambda, x,iter]=eigpower(A,tol,nmax,xO); 

After 35 iterations we obtain lambda=3. 76155718183189. To find the largest 
negative eigenvalue of A, we can use the power method with shift and, in 
particular, we can choose a shift equal to the largest positive eigenvalue that 
we have just computed. We find: 
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Fig. 9.8. The approximations of the maximum modulus eigenvalue of the 
matrix of Solution 6.5 computed by the power method 

>> [lambda2 f x f iter]=eigpower(A-lambda*eye(7),tol,nmax,xO); 

>> Iambda2+lambda 
ans = 

-1.12488541976457 

after iter = 33 iterations. These results are satisfactory approximations of 
the largest (positive and negative) eigenvalues of A. 

Solution 6.7 Since all the coefficients of A are real, eigenvalues occur in con- 
jugate pairs. Note that in this situation conjugate eigenvalues must belong to 
the same Gershgorin circle. The matrix A presents 2 column circles isolated 
from the others (see Figure 9.9 on the left). Each of them must contain only 
one eigenvalue that must therefore be real. Then A admits at least 2 real 
eigenvalues. 

Let us consider now the matrix B that admits only one column isolated 
circle (see Figure 9.9 on the right). Then, thanks to the previous consideration 
the corresponding eigenvalue must be real. The remaining eigenvalues can be 
either all real, or one real and 2 complex. 




Fig. 9.9. On the left, column circles of the matrix A of Solution 6.7. On the 
right, column circles of the matrix B of Solution 6.7 



Solution 6.8 The row circles of A feature an isolated circle of center 5 and 
radius 2 the maximum modulus eigenvalue must belong to. Therefore, we can 
set the value of the shift equal to 5. The comparison between the number of 
iterations and the computational cost of the power method with and without 
shift can be found using the following commands: 
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» A=[5 0 1 -1; 0 2 0 -1/2; 0 1-11; -1 -1 0 0]; 

>> flops(O); [lambda, x,iter]=eigpower(A,tol,nmax,xO); flops, iter 
ans = 

2204 

iter = 

34 

>> flops(O); [lambda, x,iter]=invshift(A,tol,nmax, 5, xO); flops, iter 
ans = 

1082 

iter = 

12 

The power method with shift requires in this case a lower number of it- 
erations (1 versus 3) and half the cost than the usual power method (also 
accounting for the extra time needed to compute the Gauss factorization of A 
off-line). 



Solution 6.9 Using the qr command we have immediately: 

» A=[2 -1/2 0 -1/2; 0 4 0 2; -1/2 0 6 1/2; 0019]; 

>> [Q.R]=qr(A) 

Q = 

-0.9701 0.0073 -0.2389 -0.0411 

0 -0.9995 -0.0299 -0.0051 
0.2425 0.0294 -0.9557 -0.1643 

0 0 -0.1694 0.9855 

R - 

-2.0616 0.4851 1.4552 0.6063 

0 -4.0018 0.1764 -1.9881 

0 0 -5.9035 -1.9426 

0 0 0 8.7981 

To verify that RQ is similar to A, we observe that 
Q t A = Q t QR = R 

thanks to the orhogonality of Q. Thus C = Q T AQ = RQ , since Q T = Q _1 , 
and we conclude that C is similar to A. 

Solution 6.10 A simple implementation of the proposed algorithm is: 

function [A]=qrsimple(A,it) 
for i = l:it 
[Q,R]=qr(A); 

A = R*Q; 
end 

where it represents the number of iterations once the algorithm is stopped. 
For the given matrix, after 10 iterations we obtain the matrix A^ 10 ^: 
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» A10 = 

9.2090 1.1678 1.5237 0.0705 

0.1220 4.9228 -0.5739 0.0461 

-0.1732 -1.3095 4.8686 -0.6902 

0 0 0.0017 1.9996 

while, after 50 iterations we obtain the matrix A^ 50 ^: 

» A50 = 

9.1728 -0.8113 1.8631 0.1059 

0.0000 5.8003 0.6208 0.5914 

-0.0000 -0.0000 4.0268 -0.3526 

0 0 0.0000 2.0000 

Note that the matrices A ^ are similar since A^ fc+1 ^ = (Q^) T A^Q^ for 
k > 0 (see Solution 6.9). Therefore, we can guess that the sequence of matrices 
when k — > oo tends to an upper triangular matrix whose elements are the 
eigenvalues of A. 

Solution 6.11 We can use the command eig in the following way: [X,D] =eig(A) , 
where X is the matrix whose columns are the unit eigenvectors of A and D is a 
diagonal matrix whose elements are the eigenvalues of A. For the matrices A 
and B of Exercise 6.7 we should execute the following instructions: 

» A=[2 -1/2 0 -1/2; 0 4 0 2; -1/2 0 6 1/2; 0019]; 

>> eig(A) 
ans = 

2.0000 

5.8003 

9.1728 

4.0268 

» B=[-5 0 1/2 1/2; 1/2 2 1/2 0; 0 1 0 1/2; 0 1/4 1/2 3]; 

>> eig(B) 
ans = 

-4.9921 

2.1666 

-0.3038 

3.1292 
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Solution 7.1 Let us approximate the exact solution y(t) = |[e t — sin(t) — 
cos(t)] of the Cauchy problem (7.43) by the forward Euler method using dif- 
ferent values of h: 1/2, 1/4, 1/8,.. .,1/512. The associated error is computed 
by the following instructions: 

>> y0=0; f=inline( , sin(t)+y , l , t , i y); y='0.5*(exp(t)-sin(t)-cos(t))'; 




9.7 Chapter 7 233 



>> tspan=[0 1]; N=2; for k=l : 10 

[tt,u]=feuler(f,tspan,yO,N);t=tt(end);e(k)=abs(u(end)-eval(y));N=2*N;end 

>> e 

e = 

Columns 1 through 7 

0.4285 0.2514 0.1379 0.0725 0.0372 0.0189 0.0095 

Columns 8 through 10 
0.0048 0.0024 0.0012 

Now we apply formula (1.11) to estimate the order of convergence: 

>> p=log(abs(e(l:end-l)./e(2:end)))/log(2) 

P = 

Columns 1 through 7 

0.7696 0.8662 0.9273 0.9620 0.9806 0.9902 0.9951 

Columns 8 through 9 
0.9975 0.9988 

As expected the order of convergence is one. With the same instructions (sub- 
stituting the program f euler with the program beuler) we obtain an estimate 
of the convergence order of the backward Euler method: 

>> p=dog(abs(e(l:end-l)./e(2:end)))/log(2) 

P = 

Columns 1 through 7 

1.5199 1.1970 1.0881 1.0418 1.0204 1.0101 1.0050 

Columns 8 through 9 
1.0025 1.0012 

Solution 7.2 The numerical solution of the given Cauchy problem by the 
forward Euler method can be obtained as follows: 

>> tspan=[0 1]; N==100;f==inline(’-t*exp(-y)7tVy’);y0=0; 

>> [t,u]=feuler(f,tspan,yO,N); 

To compute the number of exact significant digits we can estimate the con- 
stants L and M which appear in (7.11). Note that, since f(t,y(t)) < 0 in the 
given interval, y(t) — log(l — t 2 / 2) is a monotonically decreasing function, 
vanishing at t = 0. Since / is continuous together with its first derivative, we 
can approximate L as L = maxo<Ki \L(t)\ with L(t) = df/dy — te~ y . Note 
that L(0) = 0 and L(t) > 0 for all t £ (0, 1]. Thus, L = e. 

Similarly, in order to compute M = maxo<t<i \y" (t)\ with y" — —e~ y — 
t 2 e~ 2y , we can observe that this function has its maximum at t — 1, and then 
M = e -f e 2 . From (7.11) we deduce 

Therefore, there is no guarantee that more than one significant digit be exact. 
Indeed, we find u(end)=-0.6785, while the exact solution at t = 1 is y( 1) = 
-0.6931. 
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Solution 7.3 The iteration function is (fi(u) = u n — ht n +\e~ u and the fixed- 
point iteration converges if \<fi'(u)\ < 1. This property is ensured if h(to + 
(n + l)h) < e u . If we substitute u with the exact solution, we can provide 
an a priori estimate of the value of h. The most restrictive situation occurs 
when u = — 1 (see Solution 7.2). In this case the solution of the inequality 
(n + 1 )h 2 < e -1 is h < y^e~ 1 /(n -f 1). 

Solution 7.4 We repeat the same set of instructions of Solution 7.1, however 
now we use the program cranknic (Program 14) instead of f euler. According 
to the theory, we obtain the following result that shows second-order conver- 
gence: 

>> p=dog(abs(e(l:end-l)./e(2:end)))/log(2) 

P = 

Columns 1 through 7 

2.0379 2.0092 2.0023 2.0006 2.0001 2.0000 2.0000 

Columns 8 through 9 

2.0000 2.0000 

Solution 7.5 Consider the integral formulation of the Cauchy problem (7.4) 
in the interval [t n ,t n + 1 ]: 

t n + 1 

y(tn+i) - y{t n ) = J f(T,y(r))dT 

t n 

- \ [f(tn,V(tn)) + f(tn+l,y(tn+l))], 

where we have approximated the integral by the trapezoidal formula (4.17). 
By setting uq — y(to) and replacing y(t n ) by the approximate value u n and 
the symbol ~ by =, we obtain 

U n + 1 = U n + ^ [/(tn, U n ) + /(t n -f 1 , U n +1 )] , Vn > 0, 

which is the Crank-Nicolson method. 

Solution 7.6 Set hX = x + iy , so that |1 + h\\ 2 = (1 + x ) 2 + y 2 . Then, the 
assumption |1 -h h\\ < 1 is equivalent to (1 + x) 2 + y 2 < 1. It follows that the 
region of absolute stability for the forward Euler method is the unit circle in 
the complex plane (see Figure 9.11). 

Solution 7.7 Thanks to the result found in Solution 7.6, we must impose the 
limitation |1 — h + ih\ < 1, which yields 0 < h < 1. 

Solution 7.8 Let us rewrite the Heun method in the following (Runge-Kutta 
like) form: 

Un+l =U n J r — (ki -h k 2 ), kl = hf(tn,U n ), k 2 = hf(t n +l,U n -f &l). (9.4) 
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We have hr n +i(h) = y(t n+ 1 ) - y{t n ) - (ki + k 2 )/ 2, with k\ = hf(t n ,y(t n )) 
and k 2 = /i/(£ n +i, y(t n ) 4- ki). Therefore, the method is consistent since 

limT„ + l = y'(t n ) - i [f(tn, y(t n )) + f(tn,y(t n ))] = 0. 
h — >0 Z 

The Heun method is implemented in Program 19. Using this program, we 
can verify the order of convergence as in Solution 7.1. By the following instruc- 
tions, we find that the Heun method is second-order with respect to h 

» p=log(abs(e(l:end-l)./e(2:end)))/log(2) 

P = 

Columns 1 through 7 

1.7642 1.8796 1.9398 1.9700 1.9851 1.9925 1.9963 

Columns 8 through 9 
1.9981 1.9991 



Program 19 - rk2: the Heim method 

function [t,u]=rk2(odefun,tspan,yO,Nh,varargin) 
h=(tspan(2)-tspan(l)-tO)/Nh; tt=[tspan(l):h:tspan(2)]; u(l)=y0; 
for s=tt(l:end-l) 

t = s; y = u(end); kl=h*feval(odefun,t,y,varargin{:}); 
t = t + h; y = y + kl; k2=h*feval(odefun,t,y,varargin{:}); 
u = [u, u(end) + 0.5*(kl+k2)]; 
end 
t=tt; 



Solution 7.9 Applying the method (9.4) to the model problem (7.19) we 
obtain k\ = h\u n and k 2 = h\u n (l + hX ). Therefore u n+ 1 = u n [l + hX + 
(hA) 2 /2] = u n p 2 (hX). To ensure absolute stability we must require that \p 2 (hX)\ < 
1, which is equivalent to 0 < p 2 (hX) < 1, since p 2 (hX) is positive. Solving the 
latter inequality, we obtain —2 < hX < 0, that is, h < 2/|A|. 

Solution 7.10 Note that 

Un Un— l(l T hX n — 1 ) “|“ hv i . 

Then proceed recursively on n. 



Solution 7.11 The inequality (7.29) follows from (7.28) by setting 



V?(A) = 



, 1 




1 


1 + T 


+ 




A 


A 



The conclusion follows easily. 
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Solution 7.12 From (7.26) we have 

n — 1 

| Zn Un\ PmaxO* 4“ hpmax ^ ^ (5(/l) 

k = 0 

The result follows using (7.27). 

Solution 7.13 We have 

hr n+1 (h) = y(t n + 1 ) - y(t n ) ~ |(fci + 4£* 2 + fe), 

kl = hf(t n ,y(tn)), k 2 = hf(tn + Tj,y(t n ) + ^), 

/c 3 — /l/(tn+l,y(tn) 4- 2fo> - fci). 

This method is consistent since 

limr n+ i = y'(t n ) - ^[f(t n ,y(t n )) + 4f(t n ,y(t n )) + f(t n ,y(t n ))} = 0. 

This method is an explicit Runge-Kutta method of order 3 and is imple- 
mented in Program 20. As in Solution 7.8, we can derive an estimate of its 
order of convergence by the following instructions: 

>> p=log(abs(e(l:end-l)./e(2:end)))/log(2) 

P = 

Columns 1 through 7 

2.7306 2.8657 2.9330 2.9666 2.9833 2.9916 2.9958 

Columns 8 through 9 
2.9979 2.9990 

Solution 7.14 From Solution 7.9 we obtain the relation 

Un + 1 = u n [ 1 + h\ + ^-(hX) 2 + ^(h\) 3 ] = u n p 3 (hX). 

A o 

By inspection of the graph of p 3 , obtained with the instruction 
>> c=[l/6 1/2 1 1]; z=[-3:0.01:l]; p=polyval(c,z); plot(z,abs(p)) 
we deduce that |p 3 (hA)| < 1 for —2.5 < hX < 0. 



Program 20 - rk3; Explicit Rimge-Kutta method of order 3 

function [t,u]=rk3(odefun,tspan,y0,Nh,varargin) 
h=(tspan(2)-tspan(l))/Nh; tt=[tspan(l):h:tspan(2)]; u(l)=y0; 
for s=tt(l:end-l) 

t = s; y = u(end); kl=h*feval(odefun,t,y,varargin{:}); 

t = t + h*0.5; y = y + 0.5*kl; k2=h*feval(odefun,t,y,varargin{:}); 
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t — s + h; y = u(end) + 2*k2-kl; k3=h*feval(odefun,t,y,varargin{:}); 
u = [u, u(end) + (kl+4*k2+k3)/6]; 
end 
t=tt; 



Solution 7.15 The method (7.46) applied to the model problem (7.19) gives 
the equation u n + i = u n (l + hX + (h\) 2 ). From the graph of 1 + z + z 2 with 
z = hX , we deduce that the method is absolutely stable if — 1 < hX < 0. 

Solution 7.16 To solve Problem 7.1 with the given values, we repeat the 
following instructions with N=10 and N=20: 

» f=inline(’-1.68*10 / '(-9)*y / '4+2.6880VtYy’); 

>> [t,uc]=cranknic(f, [0,200], 180, N); 

>> [t,u]=predcor(f,[0 200],180,N,’feonestep’,’cnonestep'); 

The graphs of the computed solutions are shown in Figure 9.10. The solutions 
obtained by the Crank-Nicolson method are more accurate than those obtained 
by the Heun method. 




Fig. 9.10. Computed solutions with h — 20 (left) and h — 10 (right) for 
the Cauchy problem of Solution 7.16: in the continuous line, the solutions 
computed by the Crank-Nicolson method, in the dashed lines those computed 
by the Heun method 



Solution 7.17 Heun method applied to the model problem (7.19), gives 
u n +i — u n ^1 + hX + —h X ^ . 

In the complex plane the boundary of its region of absolute stability satisfies 
|l + /iA + /i 2 A 2 /2| 2 = 1, having set hX = x + iy. This equation is satisfied by the 
numbers (x, y) such that /(x, y) = x 4 -f y 4 + 2 x 2 y 2 + 4x 3 + 4xt/ 2 + 8x 2 + 8x = 0. 
We can represent this curve as the level curve /(x, y) = z (corresponding to 
the level 2 = 0). This can be done by means of the following instructions: 
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» f=’x. /v 4+y. / '4+2*(x."2).*(y."2)+4*x.*y.''2+4*x. /v 34-8*x. / '2-h8*x'; 

>> [x,y]=meshgrid([-2. 1:0. 1:0.1], [-2:0. 1:2]); 

>> contour(x,y,eval(f),[0 0]) 

meshgrid The command meshgrid draws in the rectangle [—2.1, 0.1] x [—2,2] a grid 
with 23 equispaced nodes in the x-direction, and 41 equispaced nodes in the 
contour ^/-direction. With the command contour we plot the level curve of /(x, y ) (eval- 
uated with the command eval(f)) corresponding to the value z = 0 (made 
precise in the input vector [0 0] of contour). In Figure 9.11 the continuous 
line delimitates the region of absolute stability of the Heun method. This region 
is larger than the corresponding region of the forward Euler method (which 
corresponds to the interior of the dashed circle). Both curves are tangent to 
the imaginary axis at the origin (0,0). 




Fig. 9.11. Boundaries of the regions of absolute stability for the Heun method 
(continuous line) and the forward Euler method (dashed line). The correspond- 
ing regions lie at the interior of the boundaries 



Solution 7.18 We use the following instructions: 

>> tspan =[0 1 ]; y 0 = 0 ; f=inline(’cos(2*y)’/t’/y ); 

» y= , 0.5*asin((exp(4*t)-l)./(exp(4*t)+l))’; 

» N=2; for k=l:10 

[tt, u]=predcor(f,tspan ,y0, N , ’feonestep’ , ’cnonestep’) ; 
t=tt(end); e(k)=abs(u(end)-eval(y)); N=2*N; end 
>> p=log(abs(e(l:end-l)./e(2:end)))/log(2) 

P = 

Columns 1 through 7 

2.4733 2.2507 2.1223 2.0601 2.0298 2.0148 2.0074 

Columns 8 through 9 
2.0037 2.0018 

As expected, we find that the order of convergence of the method is 2. However, 
the computational cost is comparable with that of the forward Euler method, 
which is first-order accurate only. 
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Solution 7.19 The second-order differential equation of this exercise is equiv- 
alent to the following first-order system: 

x = z, z — —5 z — 6x, 

with x(0) = 1, z(0) = 0. We use the Heun method as follows: 

>> tspan=[0 5] ; y0=[l 0]; 

> > [tt, u]=predcor( 'fspring' ,tspan ,y0, N , ’feonestep’ , 'cnonestep’) ; 

where N is the number of nodes and fspring.m is the following function: 

function y=fspring(t,y) 
b=5; k=6; 

yy=y; y(l)=yy(2); y(2)=-b*yy(2)-k*yy(l); 

In Figure 9.12 we show the graphs of the two components of the solution, 
computed with N=20 , 40 and compare them with the graph of the exact solution 
x(t) = 3e~ 2t — 2e~ st and that of its first derivative. 




Fig. 9.12. Approximations of x(t) (continuous line) and x'(t) (dashed line) 
computed with N=20 (thin line) and N=40 (thick line). Small circles and squares 
refer to the exact functions x(t) and x f (t), respectively 



Solution 7.20 The second-order system of differential equations is reduced 
to the following first-order system: 

x — z, 

y = V, 

z' = 2u;sin(^) — k 2 x , 
v — — 2usin(ty)z — k 2 y. 

If we suppose that the pendulum at the initial time to = 0 is at rest in the 
position (1,0), the system (9.5) must be given the following initial conditions: 

*(0) = 1, y( 0) = 0, *(0) = 0, v(0) = 0. 

Setting ^ = 7r/4, which is the average latitude of the Northern Italy, we use 
the forward Euler method as follows: 
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>> [t,y]=feuler(’ffocault’,[0 300], [1 0 0 0],Nh); 

where Nh is the number of steps and ffocault.m is the following function: 
function y=ffocault(t,y) 

1=20; k2=9.8/l; psi=pi/4; omega=7.29*l.e-05; 

yy=y; y(i)=yy(3); y(2)=yy(4); 
y(3)=2*omega*sin(psi)*yy(4)-k2*yy(l); 
y(4)=-2*omega*sin(psi)*yy(3)-k2*yy(2); 

By some numerical experiments we conclude that the forward Euler method 
cannot produce acceptable solutions for this problem even for very small h. For 
instance, on the left of Figure 9.13 we show the graph, in the phase plane (x, y ), 
of the motion of the pendulum computed with N=30000, that is, h = 1/100. As 
expected, the rotation plane changes with time, but also the amplitude of the 
oscillations increases. Similar results can be obtained for smaller h and using 
the Heun method. In fact, the model problem corresponding to the problem at 
hand has a coefficient A that is purely imaginary. The corresponding solution 
(a sinusoid) is bounded for t that tends to infinity, however it doesn’t tend to 
zero. 

Unfortunately, both the forward Euler and Heun methods feature a region 
of absolute stability that doesn’t include any point of the imaginary axis (with 
the exception of the origin). Thus, to ensure the absolute stability one should 
choose the prohibited value h = 0. 

To get an acceptable solution we should use a method whose region of 
absolute stability includes a portion of the imaginary axis. This is the case, 
for instance, for the adaptive Runge-Kutta method of order 3, implemented 
ode23 in the MATLAB function ode23. We can invoke it by the following command: 

>> [t,u]=ode23(’ffocault’,[0 300], [1 0 0 0]); 

In Figure 9.13 (right) we show the solution obtained using only 1022 integra- 
tion steps. Note that the numerical solution is in good agreement with the 
analytical one. 




Fig. 9.13. Trajectories on the phase plane for the Focault pendulum of So- 
lution 7.20 computed by the forward Euler method (left) and the third-order 
adaptive Runge-Kutta method (right) 
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Solution 8.1 We can verify directly that x T Ax > 0 for all x/0. Indeed, 



[Xi X2 • • • XN-l Xn] 



2 

-1 

0 

0 



0 



0 



*. -1 0 

-1 2 -1 

0-12 



1 




Xi 




X 2 




XN-l 




. X N . 



= 2 x( — 2x\X2 4 - 2#2 — 2x2X3 + . . . — 2xn-iXn + 2 x N . 

The last expression is equivalent to {x\ — X2) 2 + ■ . . + (#;v-i — xn ) 2 + x\ + x 2 N , 
which is, positive provided that at least one Xi is non- null. 



Solution 8.2 We verify that Aq^ = A^q^. Computing the matrix- vector prod- 
uct w = Aq^ and requiring that w is equal to the vector Xjqj, we find: 

2 sin(j0) — sin(2j0) = 2(1 — cos(j6)) sinfy‘0), 

— sin(jk6) -f 2sinfy(A: 4- 1)6) — sin (j(k 4- 2)6) = 2(1 — cosfy#)) sin(2j0), 

] k = 1 , . . . , iV — 2 

[ 2 sin(Afy 0 ) — sin((N — 1 )j6) = 2(1 — cos (j6)) sin (Nj6). 

The first equation is an identity since sin( 2 j 0 ) = 2 sin(j 6 ) cos (j6). The other 
equations can be simplified since 

sin (jk6) = sin ((k 4- 1 )j6) cos (j6) — cos ((k 4- 1 )j6) sin(j0), 
sin (j(k 4- 2)0) = sin ((/c 4- 1 )j6) cos (j6) 4- cos((k 4- 1 )j6) sin (j6). 

Since A is symmetric and positive definite, its condition number is K{ A) = 
Xmax / Xmin 5 that is, K( A) = Xi/Xn = (1 — cos(Nn/(N + 1)))/(1 — cos(7r/(Af 4- 
1 ))). Using the Taylor expansion of order 2 of the cosine function, we obtain 
K( A) ~ AT 2 , that is, K( A) ~ h~ 2 . 

Solution 8.3 We note that 

u(x + h) = u(x) 4 - hu'(x) 4 - ^-u"(x) 4 - ^-u"(x) 4 - ^j-u (4) (£+), 

u(x - h) = u(x) - hu(x) 4 - %-u" (x) - ~u"(x) 4 - ^-u (4) (£_), 

2 o 24 

where £+ G (x, x 4 - /i) e £- G (x — /i, x). Summing the two expression we obtain 
u(x 4 - h) 4 - u(x - h) = 2 u(x) 4 - /i 2 u"(x) 4 - ^t(^ ( 4) (£+) + u (4) (£-)), 



which is the desired property. 
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Solution 8.4 The matrix is again tridiagonal with entries a ;,*- 1 = — 1 — /i|, 
an = 2 + /i 2 7 , ai,i+i = — l-bh|. The right-hand side, accounting for the bound- 
ary conditions, becomes f = ( f(xi)+a(l+h8/2)/h 2 , f(x 2 ), . . . , /(xn-i), /(x;vH 
0(1 - h8/2)/h 2 ) T . 

Solution 8.5 With the following instructions we compute the corresponding 
solutions to the three given values of h : 

>> fbvp=inline(T-hsin(4*pi*x) , ,'x’); 

>> [z,uhl0]=bvp(0,l,9,0,0.1,fbvp,0,0); 

>> [z,uh20]=bvp(0,l,19,0,0.1,fbvp,0,0); 

» [z l uh40]=bvp(0,l,39,0 f 0.1 i fbvp f 0 f 0); 

Since we don’t know the exact solution, to estimate the convergence order we 
compute an approximate solution on a very fine grid (for instance h — 1/1000), 
then we use this latter as a surrogate for the exact solution. We find: 

>> [z,uhex]=bvp(0,l,999,0,0.1,fbvp,0,0); 

>> max(abs(uh9-uhex(l:100:end))) 
ans = 

8.6782e-04 

>> max(abs(uhl9-uhex(l:50:end))) 
ans — 

2.0422e-04 

>> max(abs(uh39-uhex(l:25:end))) 
ans = 

5.2789e-05 

Halving h, the error is divided by 4, proving that the convergence order with 
respect to h is 2. 

Solution 8.6 To find the largest h cr it which ensures a monotonic solution (as 
the analytical one) we execute the following cycle: 

>> fbvp=inline(T-f-0.*x’/x’); for k=3:1000 
[z,uh]=bvp(0,l,k,100,0,fbvp,0,l); if sum(diff(uh)>0)==length(uh)-l, 
break, end, end 

We let h{- 1/ (k+1)) vary till the forward incremental ratios of the numerical 
solution uh are all positive. Then we compute the vector dif f (uh) whose com- 
ponents are 1 if the corresponding incremental ratio is positive, 0 otherwise. If 
the sum of all components equals the vector length of uh diminished by 1, then 
all incremental ratios are positive. The cycle stops when k=49, that is, when 
h = 1/500 if 8 = 1000, and when h = 1/1000 if S = 2000. We can therefore 
guess that one should require h <2/8 — h cr it in order to get a monotonically 
increasing numerical solution. Indeed, this restriction on h is precisely what 
can be proven theoretically (see, for instance, [QV94]). In Figure 9.14 we show 
the numerical solutions obtained when 8 — 100 for two values of h. 
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Fig. 9.14. Numerical solution for Problem 8.6 obtained for h = 1/10 (dashed 
line) and h = 1/60 (continuous line) 



Solution 8.7 We should modify the Program 17 in order to impose Neumann 
boundary conditions. In the Program 21 we show one possible implementation. 



Program 21 - neumann: approximation of a Neumann boundary 
problem 

function [x,uh]=neumann(a,b,N .delta, gamma,bvpfun,ua,ub,varargin) 
h = (b-a)/(N+l); x = [a : h : b] ; e = ones(N+2,l); 

A = spdiags([-e-0.5*h*delta 2*e-j-gamma*h / '2 -e+0.5*h*delta], -1:1, N+2, N+2); 
f = h"2*feval(bvpfun,’x’,varargin{:}); f=f ; 

A(l,l)=-3/2*h; A(l,2)=2*h; A(l,3)=-l/2*h; f(l)=h~2*ua; 
A(N+2,N+2)=3/2*h; A(N+2,N+l)=-2*h; A(N+2,N)=l/2*h; f(N+2)=!T2*ub; 
uh = A\f; 



Solution 8.8 The trapezoidal integration formula, used on the two subinter- 
vals Ik-i and Jfc, produces the following approximation 

J f(x)<Pk(x) dx ~ ^f(x k ) + | f(x k ) = hf{x k ), 

■T/b-lU/fc 



since <fk(xj) = $jk, Vj, k. Thus, we obtain the same right-hand side of the 
finite difference method. 

Solution 8.9 We have V(f> = (d<p/dx, d(j)/dy) T and therefore divV</> = d 2 (f)/dx 2 + 
d 2 (f)/dy 2 , that is, the Laplacian of 0. 

Solution 8.10 To compute the temperature at the center of the plate, we 
solve the corresponding Poisson problem for various values of A x = A y , using 
the following instructions: 
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>> k=0; fun=inline(’25',’x’/y ); bound=inline( , (x==l) , , , x , f , y'); 

>> for N = [10,20,40,80,160], 

[u,x,y]=poissonfd(0,0,l,l,N,N, fun, bound); 
k=k+l; uc(k) = u(N/2+l,N/2-f 1); end 

The components of the vector uc are the values of the computed temperature 
at the center of the plate as the step-size h of the grid decreases. We have 

>> uc 

2.0168 2.0616 2.0789 2.0859 2.0890 

We can therefore conclude that at the center of the plate the temperature is 
approximatively 2.08 C. In Figure 9.15 we show the level-curve of the temper- 
ature for two different values of h. 




Fig. 9.15. The level-curve of the computed temperature for A x = A y = 1/10 
(dashed lines) and for A x = A y = 1/80 (continuous lines) 
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help, 28 

Heun method, 177 
Hilbert matrix, 115, 132 
hold off, 146 
hold on, 146 

if, 26 
ifft, 69 
imag, 7 

implicit method, 157 

improved Euler method, 177 

increment, 129 

Inf, 4 

inline, 33 

int, 20 

interpl, 73 

interplq, 73 

interp2, 79 

interp3, 79 

interpft, 69 

interpolant, 60 

interpolation 

Chebyshev, 65 
composite, 72, 79 
Lagrangian polynomial, 60 
nodes, 59 

piecewise linear, 72 
polynomial, 60 
rational, 60 
splines, see spline 
trigonometric, 60, 67 
inv, 10 

invshift, 144 
iteration function, 48 
iterative methods, 122 

Jacobi method, 124 

Kronecker symbol, 61, 68 
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Krylov methods, 134 
Lagrange 

characteristic polynomials, 62 
form, 62 

Lanczos method, 150 
Laplace 

operator, 187, 197 
rule, 10 
least squares 
method, 75 

lexicographic order, 197 
Lipschitz continuous function, 155 
load, 28 

local truncation error, 159, 203 
loglog, 23 

Lotka and Leslie model, 138 
Lotka-Volterra equations, 154 
lu, 110 
luinc, 134 

magic, 134 
mantissa, 3 
matrix, 8 

bidiagonal, 120 
companion, 52 
determinant, 10 
diagonal, 11 

diagonally dominant, 112, 124 
Hilbert, see Hilbert matrix 
identity, 9 
ill conditioned, 119 
inverse, 10 
iteration, 123 
permutation, 114 
positive definite, 112, 127 
product, 9 
similar, 148 

sparse, see sparse matrix 
square, 9 

strictly diagonal, 127 
symmetric, 12, 112, 127 
transpose, 12 



triangular, 12 
tridiagonal, 120, 127, 191 
Vandermonde, 109 
Wilkinson, 151 
mesh, 199 
contour, 238 
meshgrid, 80, 238 
method 

Adams-Bashforth, 175 
Adams-Moulton, 175 
backward Euler, see back- 
ward Euler method 
Bairstow, 52 

bisection, see bisection method 
Broyden, 52 

conjugate gradient, 132, 134 
Crank-Nicolson, see Crank- 
Nicolson method 
Dekker-Brent, 52 
dynamic Richardson, 130 
finite element, 204 
forward Euler, see forward 
Euler method 
Gauss-Seidel, 126 
GMRES, 134 
gradient, 130 
Jacobi, 124 
least squares, 75 
Monte Carlo, 208 
Muller, 52 
multifrontal, 134 
Newton, see Newton method 
Newton-Horner, 52 
power, see power method 
predictor-corrector, 176 
QR, 148 
relaxation, 127 
SOR, 136 
spectral, 204 

stationary Richardson, 130 
Sturm, 52 

midpoint formula, 88 
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composite, 87 
model problem, 165 
generalized, 168 
Muller method, 52 
multistep method, 175 

NaN, 5 
nargin, 33 
nchoosek, 209 

Neumann boundary conditions, 
205 

newton, 45 

Newton method, 42, 54 
adaptive, 44 
modified, 44 

Newton- Cotes formulae, 99 
Newton-Horner method, 52 
norm, 14 

numerical integration, 87 

odell3, 178 
odel5s, 176, 183 
ode23, 175, 185, 240 
ode23s, 183 
ode23tb, 175, 183 
ode45, 175 
one-step method, 156 
ones, 13 

ordinary differential 
equation, 153 
overflow, 4, 5 

partial derivative, 187 
partial differential 
equation, 153 
patch, 146 
path, 30 
peg, 134 
pchip, 75 

pdetool, 80, 202, 204 
phase plane, 179 
piecewise linear interpolation, 72 
pivot element, 109 



pivoting, 113 
by row, 114 
complete, 225 
Poisson equation, 187 
poly, 35 
polyder, 19, 65 
polyfit, 78 
polyint, 19 
polynomial, 17 

characteristic, 137 
Lagrangian interpolation, 60 
Taylor, 20 
polyval, 17, 62 
power, 141 
power method, 140 
inverse, 144 
with shift, 144 
preconditioner, 123, 130 

incomplete LU factorization, 
134 

predictor-corrector method, 176 

pretty, 208 

problem 

boundary- value, 187 
Cauchy, 155 
prod, 209 
ptomedioc, 89 

QR 

factorization, 152 
method, 148 
qr, 152 
quad8, 94 
quadl, 94, 99 
quadrature 
nodes, 94 
weights, 94 
quadratures 

interpolatory, 94 
quasi-Newton methods, 52 
quit, 27 



rand, 26 
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Rayleigh quotient, 137 
rcond, 118 
real, 7 
realmax, 4 
realmin, 4 
rectangle formula, 88 
composite, 88 
region 

of absolute stability, 184, 185, 
234 

regression line, 78, 82 
residual, 45, 119, 128 
return, 32 
root, 17 

condition, 164 
roots, 17, 52 
roundoff error, 3, 114, 115 
rpmak, 80 
rsmak, 80 

Runge’s function, 65, 67 
Runge-Kutta method, 174 

save, 28 

scalar product, 13 
scheme five point, 197 
shape function, 194 
shift, 144 
sign, 40 

significant digits, 3 
simpadpt, 98 
simple, 227 
Simpson 

adaptive formula, 96 
composite formula, 93 
formula, 93 
sin, 28 
size, 10 
sparse, 199 
sparse matrix, 121, 199 
spdemos, 80 
spdiags, 120 
spectral radius, 123 
spectrum, 140 



spline, 73 

cubic, 79, 81 
natural cube, 74 
spy, 199 
sqrt, 28 
stability 

absolute, 168 
conditioned absolute, 167 
region of absolute, 184, 185, 
234 

unconditioned absolute, 167 
stencil, 197 
stiff problem, 182 
Strassen algorithm, 25 
Sturm 

method, 52 
sequences, 150 
sum, 209 
syms, 21, 227 
system 

linear, 103 
nonlinear, 52 
triangular, 106 
tridiagonal, 120 
under deter mined, 108 

taylor, 20 

Taylor polynomial, 20, 58 
taylortool, 59 
theorem 

equivalence, 164 
first mean- value, 19 
of integration, 19 
Thomas algorithm, 120, 191 
trapezoidal formula, 90 
composite, 89 
trapz, 91 

trigonometric interpolant, 67 
tril, 12 
triu, 12 

underflow, 4 
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Van Der Pol system, 183 
vander, 109 
varargin, 15 
variance, 82, 219 
vector, 13 

column, 9 
component, 13 
linearly independent, 13 
norm, 14 
row, 9 

wave equation, 188 



wavelet, 80 
wavelets, 80 
weak formulation, 193 
whos, 10 
Wilkinson, 151 

zero 

multiple, 16 

simple, 16 
zero-stability, 163 
zeros, 9, 13 
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